gildas-lormeau / zip.js

JavaScript library to zip and unzip files supporting multi-core compression, compression streams, zip64, split files and encryption.
https://gildas-lormeau.github.io/zip.js
BSD 3-Clause "New" or "Revised" License
3.33k stars 506 forks source link

What is the fastest way to create large Zip (not an issue) #449

Closed jobisoft closed 10 months ago

jobisoft commented 10 months ago

I am a Thunderbird WebExtension developer and I am using your library to import/export messages as a zip file.

For import, I use the FS() method, because I do not have to load the entire file into memory, just to know what is inside, which is a huge speed boost. I assume that is the fastest way?

For export, I currently also use the FS() method:

    let zipFs = new fs.FS();
    ...
    zipFs.addUint8Array(filename, mboxBytes, { useWebWorkers: false }); // webWorkers cause a CSP violation 
    ...
    let zipDir = zipFs.addDirectory(...);
    ...
    zipDir.addUint8Array(filename, mboxBytes, { useWebWorkers: false }); // webWorkers cause a CSP violation 
    ...
    let blob = await zipFs.exportBlob({ level });

Using addUint8Array is faster than using addBlob() (probably because of the extra overhead to create the blob).

Is this the fastest way? Is using FS() for writes the best option? Asking because the zipFs.exportBlob() also needs a considerable amount of time.

I also would be interested if you have experiences with streaming the zip. What I can see is that downloading large zips to the local filesystem needs a lot of free system memory to hold the entire zip. The WebExtension downloads API however can download data in chunks: large files are not downloaded entirely and then written to the local filesystem, but written as soon as the data chunk is downloaded (there is a "parts" file in your download folder while downloading).

It would be great if we could send chunks of the zip as we add files to it, so exporting 5 GB of messages would not need 5 GB of free system memory. Would you have any pointers for me regarding this topic?

gildas-lormeau commented 10 months ago

The best way to improve CPU performance is to use web workers. In the event of a CSP violation, you should be able to use z-worker.js instead of the embedded script. You'll have to call zip.configure to pass the path to the script to zip.js, like this:

zip.configure({
  deflate: ["./path/to/z-worker.js"],
  inflate: ["./path/to/z-worker.js"]
});
...

I can confirm that FS#export* methods should always be the most efficient in terms of performance. I also confirm that Uint8Array are faster but are limited to 4GB (or maybe less) of data.

Internally, zip.js relies entirely on Web Streams (e.g. ReadableStream and WritableStream). If there are APIs available to take advantage of this, I recommend you use them. They should help to consume as little RAM as possible.

Otherwise, you can also write your own Reader and/or Writer classes to process data in blocks. Here is a test that shows how to write such custom classes: https://github.com/gildas-lormeau/zip.js/blob/master/tests/all/test-custom-io.js.

jobisoft commented 10 months ago

Thanks for your swift response! I hope you do not mind me asking follow-up questions. :-)

I tried to use configure:

import { fs, configure } from "./modules/zip.js/index.js";

// Define paths to overcome CSP violations.
// See: https://github.com/gildas-lormeau/zip.js/issues/449#issuecomment-1704312167
configure({
  deflate: ["./modules/zip.js/lib/z-worker.js"],
  inflate: ["./modules/zip.js/lib/z-worker.js"]
});

But calling

zipFs.addUint8Array(filename, mboxBytes, { useWebWorkers: true });

still causes a CSP violation:

Content-Security-Policy: The page’s settings blocked the loading of a resource at blob:moz-extension://46cc048d-e9ae-47b2-890f-557f64f9c6b6/08253163-b1fc-4649-b151-649eeef5f210 (“script-src”).

Did I implement your hint correctly? It could very well be that this is a limitation of the WebExtension framework itself, but before digging into that rabbit hole, I wanted to make sure I used your hint correctly.

I also confirm that Uint8Array are faster but are limited to 4GB (or maybe less) of data.

Is that a limitation of the UInt8Array itself, or of your addUint8Array function? When creating Blobs, I learned that it accepts a sequence of UInt8Arrays, but each sequence element array may not be larger then 2GB, otherwise new Blob() will throw. But honoring that, I was able to create really large blobs, for example a 10GB test file.

So if the "file" I want to add to a zip is larger than 4GB, I should be able to split it up into multiple UInt8Arrays not larger than 2GB, create a blob and then use your addBlob() to add it? I will try that with code looking like that:

  let bytes = await getMboxBytes(exportItem); // returns UInt8Array
  let buffer = bytes.buffer;
  // The Blob constructor accepts a sequence and each element may not exceed 2GB,
  // split the data into smaller chunks.
  let pos = 0;
  let chunk = 1024 * 1024 * 1024;
  let sequence = [];
  while (pos + chunk <= bytes.byteLength) {
    sequence.push(new Uint8Array(buffer, pos, chunk));
    pos += chunk;
  }
  sequence.push(new Uint8Array(buffer, pos));
  let blob = new Blob(sequence, { type: "text/plain" });

I will analyze the streaming pointers, thanks for those!

gildas-lormeau commented 10 months ago

You have to use the z-worker file located in the /dist/ folder, not the /lib/ folder. To circumvent the CSP issue, you can add the path to the z-worker.js file in the web_accessible_resources array in the manifest.json file.

Regarding the max. size of Uint8Arrayinstances, this is a limitation of JS engines, AFAIK.

gildas-lormeau commented 10 months ago

I'm moving the issue in the "Discussions" tab.