Closed bennettrogers closed 9 months ago
Sorry for the delayed response.
The zip file format allows a central directory header and a local file header to have completely different and conflicting information for the same entry. This is rarely done, and the most common times i've seen it are when the local file header is being written at a time when the program has incomplete information about the entry, which means the central directory header is generally more authoritative. However, there's no bound on how strange and frustrating zip file implementations will be, and it doesn't surprise me that you've somehow gotten zip files with conflicting extra field data. I can only speculate why the lengths and data would be different, but it's not an error; it's just silly.
If yauzl had a more complete and even lower-level api, it would fully expose all the low level information and let the client deal with it however. That's ... actually kind of a good idea actually. yauzl should probably do that. But as it's designed now, yauzl wants to present an API that gives you the file name of the entry, for example, instead of giving you 4 different file names that are all probably the same. (Yes you can encode the file name 4 times in the zip file, and it's not uncommon.)
As far as why yauzl uses the local file header extraFieldLength
, it's because it's the most reliable way to determine how many bytes to ignore in that area of the file. You're right that it's technically not completely ignored, but it's only used to measure the size of the data structure for the purpose of skipping over it.
It sounds like you might like to have a full readout of all the low level fields from both the central and local header. Let me know if that sounds helpful, and sorry again for the delayed response.
I've just released yauzl 3.1.0 which has several features that might help you.
Try running node examples/compareCentralAndLocalHeaders.js /path/to/your/file.zip
. It will show a comparison of local file header info and central directory info for each item in the zip file. In my experience the extra fields that encode timestamps and other fs-related metadata are almost never the same between the two.
For extracting the true data ranges, see readLocalFileHeader()
, and you probably want {minimal: true}
. Then also see openReadStreamLowLevel()
for more related commentary.
Let me know if that solves your problem, and please feel free to reopen if you have any further issues or question!
I'm trying to extract the data offset byte ranges for each entry in a zipfile. The files I'm working with seem to have different values for the
extraFieldLength
in the central directory file header vs the local file header. I've noticed that in the readme, you state that the local file headers are ignored except for checking the signature, but that doesn't seem exactly right. When creating areadStream
for an entry, this library (correctly) uses the values ofextraFieldLength
andfileNameLength
from the local file header to calculate thelocalFileHeaderEnd
.Do you know why the value for
extraFieldLength
would differ between these two locations, and why the local file header would be the correct value? What was your reason for using the value from the local file header? I also need to use the correct value to extract the true data ranges for each entry, but it seems I can't rely on theextraFieldLength
that is emitted for the entries generated byreadEntry()
. I'm considering forking your excellent lib so I can add an extra function to get at the correct offsets, but if you've got a better idea I'd love to hear it!Thank you!