thejoshwolfe / yauzl

yet another unzip library for node
MIT License
681 stars 77 forks source link

Serialise entry item #111

Closed popey456963 closed 4 months ago

popey456963 commented 4 years ago

Is there a way to serialise and unserialise an entry in order to retrieve an item at a later date (or from another process / server)?

thejoshwolfe commented 4 years ago

Do you want the file contents also? If not, then the entry metadata may not have much meaning when separated from the zipfile object it came from.

If you do want to serialize the file contents as well as the metadata, then it sounds like you should just store the file on the file system.

I'm not sure i understand the usecase.

popey456963 commented 4 years ago

We have some data stored within Amazon's S3 service as a zip file. We retrieve it via a custom RandomAccessReader. Some of our zip files are large enough that we want to parallelise the operation in order to make it faster.

This seems fairly easy to do in theory, have one program whose job is to quickly scan through the directory and output a series of job batches. Split up those job batches and send them to different machines in order to get them to all complete a bit of the unzipping.

At the moment our implementation is essentially something like this:

const serialize = entry => JSON.stringify(entry)
const deserialize = entry => {
  entry = JSON.parse(entry)
  entry.isEncrypted = () => (entry.generalPurposeBitFlag & 0x1) !== 0
  entry.isCompressed = () => entry.compressionMethod === 8
}

A serialised 'entry' can be transferred to another machine and handled as normal via zipfile.openReadStream(entry) once it has been deserialised.

However, our solution for deserialisation especially seems rather 'hacky'. It would be nice for the library to have an inbuilt way of serializing / deserializing entries.

thejoshwolfe commented 4 years ago

So you're opening a zipfile object, and then passing in a custom entry implementation? that's hacky indeed. from the docs on openReadStream:

entry must be an Entry object from this ZipFile.

So it sounds like what you're doing is technically undefined behavior, according to the api.

Adding formal support for this usecase would be tricky, especially in light of the major implementation changes required for #69 (which has been seeing a trickle of progress off and on for over a year). My first hunch is that this usecase would be too hard to support once #69 is solved, since the act of producing entry objects from readEntry() might modify the state of the zipfile in a way necessary for the processing of that and other entries. I'm not sure if that's actually a concern, because the work on #69 is not done yet, but i don't want to make any api guarantees that that would tie the hands of the implementation of #69.

This usecase may simply be too advanced and obscure for yauzl to support. If the feature were to be implemented somewhere, it should be in yauzl as opposed to in any higher level code, because this requires knowledge of the internal workings of the zipfile format. But I'm just not sure it's worth formal support.

You are, of course, free to use your hack to achieve your goals, but you'll just need to be careful at every yauzl version change that your hack doesn't break.

I also need to be honest: my enthusiasm for this project is quite low in the face of daunting complexity of writing a high quality solution to #69, so i may be biased to be clinical about development on this project. Sorry.

popey456963 commented 4 years ago

That's definitely fair. YAUZL is an exceptional piece of software and the only standard compliant unzipper that I've been able to find written in JavaScript, but it does mostly just seem to be you in your spare time contributing to it.

and then passing in a custom entry implementation

We're serialising an existing validated Entry object, but our serialisation process loses any methods attached to it, so we have to add them back in. Since writing this we've implemented a better way however, which might be more stable:

const deserialize = entry => Object.assign(new Entry(), JSON.parse(entry))
thejoshwolfe commented 4 months ago

Sorry for the long silence.

I've just released yauzl 3.1.0, which has a solution for your use case! Take a look at the docs for readLocalFileHeader() and openReadStreamLowLevel(), which explicitly supports serializing and deserializing the needed parameters for opening a read stream.

Let me know if that works for you, and please reopen if you have any issues or questions!