thejoshwolfe / yauzl

yet another unzip library for node
MIT License
737 stars 80 forks source link

CRC-32 checks #49

Open xeduardox opened 7 years ago

xeduardox commented 7 years ago

Readme says CRC32 check not performed. Is that an option or do you not plan to support it?

thejoshwolfe commented 7 years ago

I do not currently plan to do any CRC32 checks in yauzl. I know that is a feature of other zip file readers, and I'm not sure how valuable it would be for yauzl to support it.

In my opinion, hash-based error checking should really never be done inside a file format. A file format does not correspond to any situation where data corruption could occur. Rather error checking should happen when errors can happen, such as during network transmission. If you're trying to check for errors in your storage hardware degrading over time, then you can keep checksums of files beside the files and do your error checking whenever you like regardless of the file format. Additionally, if error checking is outside the file format, then you're not limited to a fixed set of hash algorithms, like CRC32.

That being said, it's very popular to include redundancy information in file formats. Even the brand new FLIF image format includes optional CRC32 checksums, despite everything else about the format seeming very progressive. Perhaps checksums in files are not as useless as I think. I'm not sure.

If anyone makes an argument to convince me that it's valuable to do the CRC32 checks in yauzl, I will gladly add optional support for it. Keep in mind that CRC32 computation is not free, so it will slow down unzipping slightly. Currently, yauzl can unzip faster than Info-ZIP's unzip command line program, which is written in C, for some zip files probably because yauzl is skipping the CRC32 checking.

If yauzl added support for CRC32 checking, then it would be an error emitted from the read stream obtained from openReadStream() after the file contents have been piped through before ending the pipeline.

overlookmotel commented 6 years ago

I would like to (belatedly) speak up in favour of CRC32 checking.

I agree with you in principle that hash-based error checking is better done outside of a file format. And there are more appropriate hashing algorithms than CRC32 for many uses.

However, in my use case, I do not control the source of ZIP files I work with, nor the transmission medium by which I receive them. The best indication I have of whether files are corrupted is the CRC32 values in the ZIP file.

I imagine this is not an uncommon situation (I notice some other issues on this repo where @thejoshwolfe has asked how a problematic ZIP file was created, and the answer was "no idea, someone sent it to me").

Would you be willing to reopen this issue?

thejoshwolfe commented 6 years ago

I believe I can simply add documentation to the README explaining how to do the CRC32 checks outside of yauzl. I think it's as simple as piping the readStream from openReadStream through a CRC32 checker, and comparing it to entry.crc32.

I'm reopening the issue to look into it.

overlookmotel commented 6 years ago

I have just published a module on npm yauzl-crc that adds CRC32 checking.

@thejoshwolfe If you have time, would you mind taking a look to see if I've missed anything? Streams are not my strong suit.

sirisian commented 2 years ago

For what it's worth, I am using this library to unzip archives and saw on some hardware it corrupts a file with null bytes. This is somewhat rare and is probably hardware specific. (Might be overheating, but that's just a guess). Decided to dive into how this is possible and saw this issue. On large systems the probability is near zero, but because of scale it happens. Creates some very subtle bugs.