Performance reading/writing large zip files

gildas-lormeau / zip.js

JavaScript library to zip and unzip files supporting multi-core compression, compression streams, zip64, split files and encryption.

https://gildas-lormeau.github.io/zip.js

BSD 3-Clause "New" or "Revised" License

3.38k stars 510 forks source link

Performance reading/writing large zip files #380

Closed AshleyScirra closed 1 year ago

AshleyScirra commented 1 year ago

Hi, and thanks for making a great library!

I've been looking in to the performance of using zip.js using very large zip files. In our case we can end up dealing with 100mb+ zip files with 75000+ files. I think zip.js could be faster at handling this.

I made a small benchmark here (host the files on a webserver and load index.html): zipjs-benchmark.zip

The benchmark writes a zip with 10000 tiny text files, then reads the zip extracting all the files, using default settings. (It reads and writes everything asynchronously in parallel, but throttles to max 16 concurrent, otherwise you can get some pretty pathological performance characteristics of dumping thousands of simultaneous read/writes. I don't think the throttling reduces the benchmark time in any significant way.)

Despite being on a high-end desktop, it takes ~1ms per file. That might sound fast, but it's only about 1000 files per second. So if you have a zip with 60000 files, it will take at least a minute to process it (either reading or writing), even on a high end system, which seems like a long time to wait given how fast modern systems are.

With both reading and writing, a Chrome profile shows performance bottlenecked on the main thread - using web workers doesn't look like it helps here. In both cases around half the main thread CPU time appears to be in "Run Microtasks" and "slice".

I would guess zip.js uses a great deal of promises and slices for thousands of tiny bits of work. Perhaps there is some low-hanging fruit to optimise this? For example these days most devices have so much memory that loading a 200mb zip in to memory isn't really a problem, so maybe zip.js could provide a way to read/write zips directly to/from one big ArrayBuffer, creating lightweight Uint8Arrays directly on that ArrayBuffer without making any copies. I tried prototyping that for reading zips but I couldn't get it working without slicing the ArrayBuffer, so I guess there are some internal details I didn't get right. Or maybe it could use chunks of ArrayBuffers as a sort of cache (although that might not work well with random access). Or maybe something else. Anyway, just thought I'd file this in case anyone has some good ideas.

AshleyScirra commented 1 year ago

Just been trying a bit of experimentation myself. I tried making a basic ArrayBufferReader:

    class ArrayBufferReader extends Reader {

        constructor(arrayBuffer) {
            super();
            Object.assign(this, {
                arrayBuffer,
                size: arrayBuffer.byteLength
            });
        }

        readUint8Array(offset, length) {
            const reader = this;
            return new Uint8Array(reader.arrayBuffer, offset, length);
        }
    }

However reading a zip file with this fails with Error: Central directory header not found. I think this is because zip.js uses the pattern new DataView(uint8array.buffer) in some places, which means the DataView reads the entire underlying buffer. I tried replacing those cases with new DataView(uint8array.buffer, uint8array.byteOffset, uint8array.byteLength), and it was even slower! That's weird, I have no idea why... maybe I made some mistake updating the DataView code.

If I instead fix it by slicing the ArrayBuffer, which means making lots of tiny copies, and leaving the DataView code as-is:

return new Uint8Array(reader.arrayBuffer.slice(offset, offset + length));

This brings down the time to read the zip from 16.2 seconds with BlobReader to 7.9 seconds with ArrayBufferReader - a 50% improvement even though it's still having to copy the ArrayBuffer - a big improvement for just a few lines of code! This suggests to me there is indeed low-hanging fruit here.

It would be nice to see a built-in ArrayBufferReader along these lines - callers could then decide whether to read zips with BlobReader or ArrayBuffer depending on the zip size and memory requirements. Ideally it could avoid slicing the ArrayBuffer (it shouldn't be necessary for reading) while still being faster, but I'm not sure what was going on with that benchmark. Perhaps another approach could be to just create one DataView across the entire ArrayBuffer and always refer to that, applying the offsets to its get methods.

I'm not sure how this would all apply to writing but perhaps similar gains could be achieved without too much work.

gildas-lormeau commented 1 year ago

I tried your test and I noticed that in that particular case, the call to Blob#slice when creating zip files could be avoided because the size of the data is always smaller than chunkSize (512 KB by default). The fix (see commit above) improves a little bit the performances.

Regarding the ArrayBufferReader and for small files, what about using Uint8ArrayReader instead?

AshleyScirra commented 1 year ago

what about using Uint8ArrayReader instead?

Ah, don't know how I missed that! Yeah, it works great for reading and improves performance a lot.

For writing though, Uint8ArrayWriter is slower than BlobWriter - BlobWriter takes 12.1 seconds, Uint8ArrayWriter takes 14.9 seconds. I think this is because of O(n^2) performance in Uint8ArrayWriter - writes past the end have to copy all the existing content to a new buffer and then add the new content (https://github.com/gildas-lormeau/zip.js/blob/master/lib/core/io.js#L518). It's faster to just queue up all the writes and copy them all to a single buffer in getData(). Here's a quick/naive implementation:

class Uint8ArrayWriter extends Writer {

    init(initSize = 0) {
        super.init();
        Object.assign(this, {
            offset: 0,
            arrayParts: [],
            totalSize: 0
        });
    }

    writeUint8Array(array) {
        const writer = this;

        writer.arrayParts.push(array);
        writer.totalSize += array.length;
    }

    getData() {
        const writer = this;
        const arrayBuffer = new ArrayBuffer(writer.totalSize);
        const ret = new Uint8Array(arrayBuffer);

        let i = 0;
        for (const array of writer.arrayParts)
        {
            ret.set(array, i);
            i += array.length;
        }

        return ret;
    }
}

For me this takes the zip write time down to about 11 seconds. That's much faster than the previous Uint8ArrayWriter implementation, but still only a little faster than BlobWriter.

One possible reason it's still slow is blob() calls still come up in the profile for writing a zip with Uint8ArrayWriter. This seems to be because getFileEntry() creates a new BlobWriter() regardless of the writer type: https://github.com/gildas-lormeau/zip.js/blob/master/lib/core/zip-writer.js#L354 - perhaps if that was also a Uint8ArrayWriter when writing the zip with a Uint8ArrayWriter it would be faster?

Also, even with Uint8ArrayReader, 7 seconds still seems pretty slow to read 10000 files - that's still 42 seconds to read a zip with 60000 files. Surely it should be able to read 10000 files in ~1 second? A Chrome profile shows it seems to be extremely heavy on posting tasks to the main thread. I suspect it's a bit of a worst case scenario for task posting. I'm not sure what ultimately is causing that (maybe the use of streams?) or if perhaps a browser issue should be filed for browsers to optimise it.

gildas-lormeau commented 1 year ago

I agree that your implementation is faster but it also consumes twice as much RAM. Note also that when reading entries in a zip file, init is called with the size of the uncompressed data (via initSize). It means that the test here https://github.com/gildas-lormeau/zip.js/blob/master/lib/core/io.js#L515 is never true and this perf issue should only happen when creating zip files, not when reading them.

Actually, I've seen similar issues with BlobWriter before integrating Streams into zip.js. As a compromise, the implementation was limiting the number of pending chunks (see code below). Maybe, that would give acceptable results. An alternative would be to rely on streams like the current implementation of BlobWriter but it might also consumes twice as much RAM. Another approach could be to create Uint8Array objects that could be larger than necessary (e.g. the size of the uncompressed data when writing an entry) and make sure they are sliced to the correct size when calling getData.

https://github.com/gildas-lormeau/zip.js/blob/9b56c151f5213d00490bc2a09c6bf74f955b19aa/lib/core/io.js#L188-L225

Maybe zip.js should provide 2 sub-types of Reader/Writer classes. One that would optimize the RAM usage and another one that would optimize the CPU usage. Note that it's totally OK (and encouraged) to provide your own Reader/Writer classes when necessary, the library has been designed for that.

Apart from that, I will try to do some profiling tests to see if I can find some points of improvement but I feel that it will be hard.

gildas-lormeau commented 1 year ago

I forgot to mention that zip.js relies on BlobWriter instead of Uint8ArrayWriter internally when adding entries concurrently in a zip file because the max. size of Uint8Array objects might be less than 4GB (it's not specified). This is a problem for zip64 support. I agree that Uint8ArrayWriter might be used in some particular cases though, e.g. when zip64 is explicitly set to false or when, as you suggest, the user passed a Uint8ArrayWriter object.

AshleyScirra commented 1 year ago

That's a good point about the RAM usage and maximum supported size, I hadn't thought about those. For our case it's not a huge problem, but then there will be other cases where those are important concerns.

You're right that custom reader/writer classes gives us scope to make our own tradeoffs - we can switch to using a customised writer to improve performance for our specific case. More built-in classes for these tradeoffs would be nice but in a way that is supported already with the custom classes.

I still think the default blob reader/writer classes are strangely slow though. It's hard to make sense of a profile as it seems to mostly be thousands of tiny tasks, and so I'm inclined to suspect there is too much task-posting overhead, but the profile doesn't really offer any clues as to why so many tasks are posted (I suspected the stream API as a bit of a wild guess). Perhaps for this aspect we could escalate it to a browser developer like Google and see what their engineers say? Optimising a widely-used zip library should be a compelling optimisation target for them.

gildas-lormeau commented 1 year ago

Well, the BlobWriter and BlobReader classes are quite simple since I refactored zip.js in order to entirely rely on streams under the hood. BlobWriter is basically a wrapper of a TransformStream. BlobReader is most more complex because it supports random reads but it delegates also as much as it can to a ReadableStream.

Do you know how much slower zip.js is compared to another program/implementation (in any language/lib) when running this test? I didn't do this kind of comparison tests on my end.

gildas-lormeau commented 1 year ago

I'm moving this issue in the "Discussions" tab.