kylefarris / clamscan

A robust ClamAV virus scanning library supporting scanning files, directories, and streams with local sockets, local/remote TCP, and local clamscan/clamdscan binaries (with failover).
MIT License
230 stars 68 forks source link

Chunked passthrough? #99

Closed BobbyWibowo closed 2 years ago

BobbyWibowo commented 2 years ago

A WriteStream created with fs.createWriteStream() accepts a flag, which allows the stream to be appended (doc), via Stream.pipe(writeStream, { end: false }) This facilitates the so-called "chunked uploads", where a file that needs to be uploaded is first split at client's end Each chunks are then sent to the server, sequentially And server will then keep on piping each chunks to that single WriteStream as they come

Assuming such workflow, I was wondering, does it make sense to request that one day clamscan.passthrough() be able to support such split/chunked data, somehow?

I'd assume clamscan will have to facilitate withholding the passthrough data until they're done, before forwarding them to be scanned to clamav? Or somehow hold the stream until a code manually says that the stream has finished? Honestly the back-end of this all is a bit beyond me, and I only have some basic knowledge of how to make use of Streams and not their specifics, so excuse me if I'm not making much sense 😅

kylefarris commented 2 years ago

What you're asking for is already happening. Node handles chunking automatically. You can choose to implement a Writable Stream anyway you like. Remember, the endpoint is not always a file--it could be another service that accepts streams, for instance. A simple example would be that you could use the passthrough method to pipe a file upload stream through clamscan and then into zlib to compress it, and then onto Amazon S3.

For example:


const { createGzip } = require('node:zlib');
const { passthrough } = require('clamscan');
const AWS = require('aws-sdk');

AWS.config.region = '<your region here>';

const s3Config = {
    params: {
        Bucket: '<your bucket name here>',
    },
};
const s3 = new AWS.S3(s3Config);
const s3Stream = require('s3-upload-stream')(s3);

const gzip = createGzip();
const uploadStream = getSomeUploadStream(); // oversimplification
const clamscan = await new NodeClam().init({
    debugMode: true,
    clamdscan: {
        host: 'localhost',
        port: 3310,
        bypassTest: true,
    },
});
const av = clamscan.passthrough();

// Do some stream piping 
input.pipe(av).pipe(gzip).pipe(s3Stream);

// Handle events from passthrough
av
    .on('error', (error) => {
       // Handle errors
    })
    .on('timeout', () => {
        // scan/stream has timed-out
    })
    .on('finish', () => {
        // stream has been fully read and sent to scanner
    })
    .on('end', () => {
        console.log('All data has been scanned sent on to the destination!');
    })
    .on('scan-complete', (result) => {
        console.log('Scan Complete: Result: ', result);
        if (result.isInfected === true) {
            // stream is a virus
        } else if (result.isInfected === null) {
            // Issue scanning stream
        } else {
            // stream is not a virus
        }
    });

output.on('finish', () => {
    // data has been fully written to the output
    output.destroy();
});

output.on('error', (error) => {
    console.log('Final Output Fail: ', error);
});
BobbyWibowo commented 2 years ago

Hi there,

The deal is that the file is split on the client's end into multiple chunks of bytes, which then gets uploaded one by one, on their own POST requests

So each incoming request's ReadStream is not a complete file, but instead just a chunk of bytes


Here's the rough description of how it works:

First I'll initiate a WriteStream for an output physical file, with flags: 'a' option for appending mode, that will be shared across all the requests Clients initiate unique UUIDs for each unique individual files so that the server knows which group of requests to pipe into which shared WriteStream

Then I just pipe all the ReadStream, in sequence, into said WriteStream:

// end set to false so that the shared WriteStream will not be finalized after a single ReadStream's pipe
readStream.pipe(writeStream, { end: false })

After uploading all the chunks of bytes, client will then send another request to the server to finalize the stream, after which I will do:

writeStream.end()

The final output physical file will properly end up being bytes-identical with the pre-split file on the client's end

If you're wondering why such elaborate system is even a thing that people do, for my case it's to work around 100 MB POST limit of Cloudflare Free plan, but applicable to any other proxies with limited POST size It basically allows me to immediately have a complete file (fast), instead of having to write into multiple "chunk files" which will then be re-combined later (slow)


But I can't say the same if I first pipe it through clamscan's PassthroughStream (assuming I also have it be first initiated for the individual file to then share across all the requests of chunks of bytes)

readStream
  .pipe(scanStream, { end: false })
  .pipe(writeStream, { end: false })

For reasons that I cannot really understand, the bytes that get piped into the writeStream will become all sorts of funky


I'm aware that it works as expected when simply dealing with a whole file though (i.e. something like the example codes in your reply)

kylefarris commented 2 years ago

Yeah, unfortunately, clamscan can't really scan files as partials since the bad part could be split across chunks and hence undetectable.

A rudimentary example: If the following string is known virus 'ABC1234'

Chunk 1: 'foobarAB' (looks clean, let it through) Chunk 2: 'C1234baz' (looks clean let it through)

Concatenated Output: 'foobarABC1234baz' (virus made it through, not good!)

So, behind the scenes, when using the passthrough method this package is essentially splitting the Readable stream into 2 readable streams, sending one to the clamav socket/IP service and one to the piped output (for example, a writable file stream). It only sends that chunk to the piped output if we can confirm the chunk has been received by clamav. If at any point the clamav service detects a virus on the accumulated packets it has received, it will immediately kill the secondary piped output and emit the 'scan-complete' event with the isInfected set to true. You should then delete the partially-written-to file immediately.

In other words, it can detect a virus mid-stream but only if it has all the chunks preceding it. This all happens in a socket "session". We open a socket connection to clamav, send some commands to let it know we are going to scan stuff, and then send some chunks. After each chunk it acknowledges receipt and then responds with GOOD/BAD. If any are BAD, we stop. If all are GOOD, we keep sending chunks until we have no more. Once all chunks are sent, we close the socket connection to clamav. If that socket connection is open for too long without anything written to it, ClamAV throws a TIMEOUT. Either way, that socket connection is created and stored in a handle/variable when the passthrough method is called. I can't really be stored "globally".

Long story short, I don't think we can have an infinitely-appendable writable stream to the ClamAV socket in the same way that we can with a file. Infinitely-appendable files are easy since there aren't any "timeout" issues with files and they are always written to disk whereas commands written to a socket sessions are not.

I hope that makes sense, haha.

BobbyWibowo commented 2 years ago

Thank you for the thorough explanation

I see, that does make sense

The random idea I wrote in the original post:

I'd assume clamscan will have to facilitate withholding the passthrough data until they're done, before forwarding them to be scanned to clamav?

would not really make much sense either, now that I think about it

By that point, you'd simply be either wastefully keeping extra temp physical files for the withheld bytes, or simply ending up with more memory usage if they are simply being withheld in memory

Perhaps I'll just use the passthrough during the finalization, when it comes to chunked uploads, in the stage where the server needs to move over the final file from the working directory to the final storage They are not already being written to the final storage directly, because some server owners that use network drivers learned that apparently such setup does not support append mode of WriteStream, which sounds like yet another deep rabbit hole in technicalities so my stop-gap measure has always been to just fs.copyFile, haha

Anyways, thanks again, and also for the awesome work you've put into this library!

I'm satisfied enough with what I've ended up learning from this issue, so I'll be closing it now

kylefarris commented 2 years ago

Okay. Yeah, in your use-case, I think that's gonna be the best way forward. Have a good one!