mhr3 / unzip-stream

node.js cross-platform unzip using streams
MIT License
78 stars 30 forks source link

Unexpected signature in zip file error #50

Open neilmayhew opened 2 months ago

neilmayhew commented 2 months ago

The full message (with debug logging) is:

Unexpected signature in zip file: 0x16d4b50 "PKm", skipped 1 bytes

This zip file is one that was created by the upload-artifact GitHub Action. The download-artifact action is unable to download the artifact because it's using unzip-stream which is unable to extract it.

I can provide the zip, but it's fairly large (938M) so I'm not sure if I should attach it here.

rmunn commented 2 months ago

If you haven't changed your artifact retention settings, the zip file causing the issue will be retained for 90 days by default. As long as your build is in a public repo, then you could just link to the specific build page and the artifact .zip file would be available to download from there.

neilmayhew commented 2 months ago

Unfortunately, even though it's an open-source project, we have our retention time set to 1 day. We use these artifacts solely for the purpose of moving build results from the build jobs to the test jobs, so there's no need to keep them.

rmunn commented 2 months ago

https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/attaching-files says the maximum file size for issue attachments is 10 MB for images and videos (100 MB for videos if the uploader has a paid plan) and 25 MB for all other file types, so it's a moot point. If you want to share the failing .zip file, you'll need to upload it to a service like Google Drive or Mega Share.

neilmayhew commented 2 months ago

I've put the zip archive here.

The archive can be opened and the contents extracted by every other zip utility I've tried. It was produced by actions/upload-artifact which uses the archiver package to create it. The version of archiver being used is currently 7.0.1 but it was previously 5.3.1. I've been able to work around the problem in our CI by pinning the version of actions/upload-artifact to 4.2.0 which uses the older version of archiver.

I've been using the following script to run unzip-stream on it:

#!/usr/bin/env node

'use strict'
const fs = require('fs');
const unzip = require('unzip-stream');

fs.createReadStream(process.argv[2])
  .pipe(unzip.Extract({
    path: process.argv[3],
    debug: true
  }));

The results are like this:

$ scripts/unzip.js ~/Downloads/state-8.10.7-ubuntu-latest.zip output
decoded LOCAL_FILE_HEADER: {
  "versionsNeededToExtract": 20,
  "flags": "0x8",
  "compressionMethod": 8,
  "lastModifiedTime": 29644,
  "lastModifiedDate": 22738,
  "crc32": 0,
  "compressedSize": 0,
  "uncompressedSize": 0,
  "fileNameLength": 9,
  "extraFieldLength": 0,
  "extra": {},
  "path": "state.tar",
  "extraFields": []
}
Skipped 1 bytes
Unexpected signature in zip file: 0x16d4b50 "PKm", skipped 1 bytes
node:events:496
      throw er; // Unhandled 'error' event
      ^

Error: Invalid signature in zip file
    at UnzipStream.processDataChunk (~/src/actions/toolkit/node_modules/unzip-stream/lib/unzip-stream.js:155:40)
    at UnzipStream._parseOrOutput (~/src/actions/toolkit/node_modules/unzip-stream/lib/unzip-stream.js:651:28)
    at Array.done (~/src/actions/toolkit/node_modules/unzip-stream/lib/unzip-stream.js:713:18)
    at callFinishedCallbacks (node:internal/streams/writable:973:25)
    at finish (node:internal/streams/writable:944:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
Emitted 'error' event on Extract instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

Node.js v20.12.2

package.json specifies "unzip-stream": "^0.3.1"

rmunn commented 2 months ago

The offending 0x50 0x4b 0x6d 0x01 sequence appears right at the end of the file, about where the central file directory header (0x50 0x4b 0x01 0x02) would be expected. I was able to look at @neilmayhew's project yesterday, and downloaded a couple of zip files produced by a different run of the GitHub Actions workflow before the artifacts expired, so I have another bad .zip file with the exact same 0x50 0x4b 0x6d 0x01 signature, and a good .zip file produced by the same build run. Here are some hexdumps from the ends of those three zip files:

The .zip file that @neilmayhew linked to in https://github.com/mhr3/unzip-stream/issues/50#issuecomment-2179207590:

3a91e0c0: 74f8 a094 e091 3a00 0000 0000 504b 6d01  t.....:.....PKm.
3a91e0d0: 0000 0050 4b01 022d 032d 0008 0008 00cc  ...PK..-.-......
3a91e0e0: 73d2 5835 74f8 a0ff ffff ffff ffff ff09  s.X5t...........
3a91e0f0: 001c 0000 0000 0000 0020 00a4 81ff ffff  ......... ......
3a91e100: ff73 7461 7465 2e74 6172 0100 1800 0050  .state.tar.....P
3a91e110: 4b6d 0100 0000 94e0 913a 0000 0000 0000  Km.......:......
3a91e120: 0000 0000 0000 504b 0506 0000 0000 0100  ......PK........
3a91e130: 0100 5300 0000 d3e0 913a 0000            ..S......:..

Another .zip file that failed, from a different run of the same build process:

3a90ce30: 4ccf ba04 ce90 3a00 0000 0000 504b 6d01  L.....:.....PKm.
3a90ce40: 0000 0050 4b01 022d 032d 0008 0008 0096  ...PK..-.-......
3a90ce50: 6ad2 58c9 4ccf baff ffff ffff ffff ff09  j.X.L...........
3a90ce60: 001c 0000 0000 0000 0020 00a4 81ff ffff  ......... ......
3a90ce70: ff73 7461 7465 2e74 6172 0100 1800 0050  .state.tar.....P
3a90ce80: 4b6d 0100 0000 04ce 903a 0000 0000 0000  Km.......:......
3a90ce90: 0000 0000 0000 504b 0506 0000 0000 0100  ......PK........
3a90cea0: 0100 5300 0000 43ce 903a 0000            ..S...C..:..

A valid .zip file from the same build process, but a different combination of matrix flags so it didn't trigger whatever is producing these 0x50 0x4b 0x6d 0x01 signatures:

385c9820: 0000 0040 b545 0100 0000 504b 0102 2d03  ...@.E....PK..-.
385c9830: 2d00 0800 0800 6664 d258 e4dc 6ba0 ffff  -.....fd.X..k...
385c9840: ffff ffff ffff 0900 1c00 0000 0000 0000  ................
385c9850: 2000 a481 ffff ffff 7374 6174 652e 7461   .......state.ta
385c9860: 7201 0018 0000 40b5 4501 0000 00eb 975c  r.....@.E......\
385c9870: 3800 0000 0000 0000 0000 0000 0050 4b05  8............PK.
385c9880: 0600 0000 0001 0001 0053 0000 002a 985c  .........S...*.\
385c9890: 3800 00                                  8..
neilmayhew commented 2 months ago

My assumption, after looking at the code for a bit, is that it's legitimate for archive creators to insert padding bytes between segments, and that it's difficult for a streaming consumer of the archive to navigate these correctly if they happen to contain garbage. A non-streaming consumer can use the directory entries to seek to the right offset in the file and thus never sees the garbage.

As the README says,

Please note that the zip file format isn't really meant to be processed by streaming, though this library should succeed in most cases

It seems that my project has been hitting those cases in which it doesn't succeed, perhaps because of the size of the files being archived. This would explain why it fails for some of the matrix jobs and not others, because the older compiler produces slightly larger build results. We found that gzipping the tar archive that's placed in the zip archive bypassed the problem, perhaps because the files are smaller. However, this double compression adds about 30min to our CI time, so the downgrade approach is preferable.

neilmayhew commented 2 months ago

The fact that the "bad signature" appears near the end of the file is consistent with the behaviour I've observed, which is that it takes a while for the Extract operation to encounter the error, presumably because it's been reading all of the content of the large single archive member before it hits the bad bytes.

rmunn commented 2 months ago

I think I've figured out what's going on. The file size of the uncompressed state.tar file is 6128619520 bytes. If you convert that to hex, 6128619520 is 0x16d4b5000. Does that hex value look familiar? Skip 1 byte in little-endian order and you get 0x16d4b50, the very value that unzip-stream was interpreting as a signature since it contains the bytes 0x50 0x4b.

In fact, the hexdumps I posted in my previous comment were just one line too short. I should have started with the previous line:

3a91e0b0: 430d 35d4 5043 0dbb e07f 0150 4b07 0835  C.5.PC.....PK..5
3a91e0c0: 74f8 a094 e091 3a00 0000 0000 504b 6d01  t.....:.....PKm.
3a91e0d0: 0000 0050 4b01 022d 032d 0008 0008 00cc  ...PK..-.-......
3a91e0e0: 73d2 5835 74f8 a0ff ffff ffff ffff ff09  s.X5t...........
3a91e0f0: 001c 0000 0000 0000 0020 00a4 81ff ffff  ......... ......
3a91e100: ff73 7461 7465 2e74 6172 0100 1800 0050  .state.tar.....P
3a91e110: 4b6d 0100 0000 94e0 913a 0000 0000 0000  Km.......:......
3a91e120: 0000 0000 0000 504b 0506 0000 0000 0100  ......PK........
3a91e130: 0100 5300 0000 d3e0 913a 0000            ..S......:..

Here we see a 0x50 0x4b 0x07 0x08 signature, which marks the start of a data descriptor as follows:

Signature: 504b 0708 Crc-32: 3574 f8a0 Compressed size: 94e0 913a 0000 0000 = 0x3a91e094 = 982638740 decimal Uncompressed size: 0050 4b6d 0100 0000 = 0x16d4b5000 = 6128619520 decimal

So what's happening is that unzip-stream is reading the data descriptor header, but incorrectly thinking that it will have 4-byte sizes rather than 8-byte sizes. This results in unzip-stream starting to look for the next data descriptor header right at the start of the uncompressed size bytes. And they just so happen to have 0x50 0x4b in them, so unzip-stream says "Oh, a signature marker" and processes it as such, then throws an error when the signature marker turns out to be invalid. If instead unzip-stream had skipped the invalid signature marker and kept on looking, it would have found a valid 0x50 0x4b 0x01 0x02 marker just seven bytes later.

But wait: why did unzip-stream think the data descriptor would have 4-byte file sizes? Because here's the header of the file and what it means:

00000000: 504b 0304 1400 0800 0800 cc73 d258 0000  PK.........s.X..
00000010: 0000 0000 0000 0000 0000 0900 0000 7374  ..............st
00000020: 6174 652e 7461 72ec bdfd 72db 46b6 2f3a  ate.tar...r.F./:

Signature: 50 4b 03 04 (local file header) Version: 1400 Flags: 0800 (data descriptor present) Compression: 0800 (deflate) Mod time: cc73 Mod date: d258 Crc-32: 0000 0000 (in data descriptor instead) Compressed size: 0000 0000 (in data descriptor instead) Uncompressed size: 0000 0000 (in data descriptor instead) File name len: 0900 Extra field len: 0000 Filename: state.tar Extra field: (empty)

Now, the .zip format documentation says this in section 4.4.9:

If bit 3 of the general purpose bit flag is set, these fields are set to zero in the local header and the correct values are put in the data descriptor and in the central directory. If an archive is in ZIP64 format and the value in this field is 0xFFFFFFFF, the size will be in the corresponding 8 byte ZIP64 extended information extra field.

Here there was no extra field, and unzip-stream's assumption is that sizes will be 8 bytes if and only if there's a zip64 extra field. E.g., this code that looks for the data descriptor in _prepareOutStream:

    var fileSizeKnown = !(vars.flags & 0x08);
    if (fileSizeKnown) {
        entry.size = vars.uncompressedSize;
    }
    // ...
    if (!fileSizeKnown) {
        var pattern = new Buffer(4);
        pattern.writeUInt32LE(SIG_DATA_DESCRIPTOR, 0);
        var zip64Mode = vars.extra.zip64Mode;
        var extraSize = zip64Mode ? 20 : 12;
        var searchPattern = {
            pattern: pattern,
            requiredExtraSize: extraSize
        }
    // ...
    }

If you search for zip64Mode, you'll find that it's only set if unzip-stream encounters an extra field with type 0x0001, the marker that indicates zip64 format.

However, this zip file proves that some programs that create .zip files do not set zip64 mode. Is this a bug in the creating program? Well, yes: section 4.3.9 of the zip file standard says:

4.3.9  Data descriptor:

    crc-32                          4 bytes
    compressed size                 4 bytes
    uncompressed size               4 bytes

  4.3.9.1 This descriptor MUST exist if bit 3 of the general
  purpose bit flag is set (see below).  It is byte aligned
  and immediately follows the last byte of compressed data.
  This descriptor SHOULD be used only when it was not possible to
  seek in the output .ZIP file, e.g., when the output .ZIP file
  was standard output or a non-seekable device.  For ZIP64(tm) format
  archives, the compressed and uncompressed sizes are 8 bytes each.

  4.3.9.2 When compressing files, compressed and uncompressed sizes 
  SHOULD be stored in ZIP64 format (as 8 byte values) when a 
  file's size exceeds 0xFFFFFFFF.   However ZIP64 format MAY be 
  used regardless of the size of a file.  When extracting, if 
  the zip64 extended information extra field is present for 
  the file the compressed and uncompressed sizes will be 8
  byte values.  

So 8-byte sizes in the absence of a zip64 marker are technically in violation of the spec. However, due to the fact that this .zip file was produced by GitHub Actions' upload-artifact action (which streams the .zip to its output), it was not actually possible for the compressed file size to be known ahead of time. And since the spec requires the compressed file size to be written into the zip64 extra field, the upload-artifact action chose not to write one because its compressed file size would be invalid anyway.

Which means that unzip-stream probably needs to try both the 4-byte version (total 4+4+4 = 12 bytes) and the 8-byte version (total 4+8+8 = 20 bytes) of the data descriptor, and check whether it finds a valid signature immediately following each. If a valid signature is found after either one, then that will tell you whether the sizes are 4 btyes or 8 bytes. (It is not possible for a valid signature to be found after both: if the data descriptor was supposed to be 12 bytes and you read 20 bytes, you'll be 8 bytes into the next block, which will put you in either the "flags" or "compression method" field. Either way, 0x504b would be a totally invalid value for flags or compression method, and will not appear.)

Update: Actually, a better approach would be to read the compressed file size as either a 4-byte value or an 8-byte value, and see if either one corresponds to the actual compressed size encountered thus far (which will be known at that point). If both are valid (and thus the same value), then the correct answer is to use 8-byte values. Why? Well, if the compressed file size is in fact less than 4GiB (let's say it's 0x12345678) but is written as an 8-byte value, then it will look like 78 56 34 12 00 00 00 00 and reading it as a 4-byte value or an 8-byte value will be correct. BUT if the compressed file size was actually written as a 4-byte value, then it will be followed by an uncompressed size, e.g. 78 56 34 12 f3 c8 a0 9a, and reading that as an 8-bit value will be totally wrong. Therefore if both a 4-byte read and an 8-byte read of the compressed file size produce the same value, then the 8-byte read is the correct one. (Besides which, if it was actually the 4-byte version it would mean that the uncompressed file size was 0 bytes, and a file of 0 bytes would have been stored with 0 bytes of compressed size, and there wouldn't be any 4-byte vs. 8-byte confusion).

neilmayhew commented 2 months ago

@rmunn Amazing! You've done some excellent detective work there.

So is your conclusion that the archive is invalid, because it doesn't indicate Zip64 format correctly? Or that unzip-stream's logic is incorrect?

rmunn commented 2 months ago

@neilmayhew - I think I answered that with the edit I was making while you commented. :-) But basically, it's both. The zip file is technically invalid due to the absence of a zip64 extra field, if I'm understanding the spec correctly. (Update: Nope. The zip file was valid; it does contain a zip64 extra field. See https://github.com/mhr3/unzip-stream/issues/50#issuecomment-2179905464 for more). But because it was produced by streaming the zip data directly to the output pipe (which was an upload to a storage area on Azure), it was impossible to know the compressed size ahead of time, so it was actually impossible to produce a valid zip64 extra field (which requires compressed and uncompressed sizes to be recorded). (Update: Actually, the valid zip64 extra field was produced, it was just at the bottom of the file because that's the only place you can put it when you're streaming, as you don't know the compressed file size ahead of time. So the zip file was valid, albeit tricky to handle in streaming mode.)

Therefore, my conclusion (at the bottom of my now-edited previous comment) is that unzip-stream should, on encountering a data descriptor header, try reading it in both 12-byte and 20-byte sizes, and try parsing the file sizes as either 4 bytes or 8 bytes. If the 8-byte size parsing produces incorrect values (the compressed file size is known at that point so that's an easy one to compare with because it occupies bytes 5 through 9 or bytes 5 through 12, so either way the bytes you're interpreting are entirely from the data descriptor header), then you know the 12-byte version (with 4-byte file sizes) was the correct one, and you have read 8 bytes too far into the stream. Push those 8 bytes back into your parser because they almost certainly start with another marker (probably a 0x50 0x4b 0x01 0x02 marker) and parse them.

Alternately, just read the 12-byte version and try to parse it. If compressed file size matches the known file size you received and is non-zero, but the uncompressed file size (the next 4 bytes) is zero, then you know that you actually are dealing with an 8-byte file size, and you need to read 8 more bytes to get the uncompressed file size. (You can know this because the spec says that a zero-byte file must be stored with no file data in the .zip file, and therefore the compressed file size should also have been zero. So if you see a 4-byte compressed file size that is non-zero, followed by four zero bytes, then you can know for certain that you're really looking at an 8-byte file size that is less than 4 GiB).

rmunn commented 2 months ago

So this bug could only trigger under the following conditions:

  1. The zip file does not contain an extra field of type 0x0001 (zip64).
  2. It does contain a data-description field.
  3. The file size of the uncompressed file is greater than 4 gigabytes.
  4. The file size of the uncompressed file, expressed in little-endian order, happens to contain the bytes 0x50 0x4b in that order (so the hex size, written out in standard notation, would have the digits 4b50 somewhere in it).

The chances of all that happening are extremely small, but here we are. Congratulations, @neilmayhew - you have won the bug lottery. :grinning:

rmunn commented 2 months ago

Correction: the zip file is valid. There is, in fact, a zip64 extra field — but it's at the end of the file (in the central directory record), not immediately after the local file header. Non-streaming unzip software would therefore look at the central directory record, see that there's a zip64 extra field, and know that the file sizes involved would be 8 bytes. So that software, if it even looked at the data-description field at all (which wouldn't really be necessary given that info is all duplicated in the end-of-file central directory record), would know that it should be 20 bytes in size, not 12 bytes.

But unzip-stream can't see the end-of-file directory while streaming, so it actually has no way at all to tell if the data-description field is going to be 12 bytes or 20 bytes, because it can't count on having seen a zip64 extra field before reaching the end of the zip file. The only reliable way to tell is going to be to read the compressed file size as an 8-byte value, then do one of two things:

neilmayhew commented 2 months ago

Thanks, @rmunn, for this very comprehensive analysis.

Stepping back a bit, I see two things. First, the zip archive format is messy because extending it to support 64 bits was tricky. Second, it really isn't friendly to being read in a streaming fashion.

There have been calls to switch the format to tar, and this would be a great solution from a technical perspective, not least because as a "tape archiver" it was designed for streaming. As I'm sure you'll have seen, we're already using tar for the content of our artifact, because zip doesn't preserve file permissions. We're also using the --format pax option with tar because the default format has only a one-second resolution for timestamps, and this causes unnecessary rebuilding when we use the content.

I'm not sure what the logistical implications of supporting tar would be, though. Presumably there could be an input that specifies the format, and it could default to zip. Another approach would be to have a fork of the actions code that uses tar format. Given the presumed lack of resources in the official GH team, perhaps the latter is the best way to go. I'd be interested in collaborating on that if you were up for it, and I'm guessing there would be others too.

There's a tar package on NPM that seems to be widely used and actively maintained. However, forking the artifact actions repos really means forking the toolkit repo, which contains a lot of other functionality that would have to be maintained in parallel with the upstream one.

neilmayhew commented 2 months ago

Another alternative has just occurred to me. If the upload action could arrange for the Zip64 marker to be put at the beginning of the file instead of at the end, this would make it possible to read the file in a streaming fashion, and avoid having to make changes to unzip-stream. Also, in this particular use case (GitHub artifacts), if not all use cases, using Zip64 format by default makes sense, too. The space savings from using Zip32 are small and probably not worth the trouble.

rmunn commented 2 months ago

Another alternative has just occurred to me. If the upload action could arrange for the Zip64 marker to be put at the beginning of the file instead of at the end, this would make it possible to read the file in a streaming fashion, and avoid having to make changes to unzip-stream. Also, in this particular use case (GitHub artifacts), if not all use cases, using Zip64 format by default makes sense, too. The space savings from using Zip32 are small and probably not worth the trouble.

The upload library streams its contents to the destination as it zips them, which is why it can't know the compressed size ahead of time. And the zip spec says that the marker at the beginning of the file, if present, is supposed to include the compressed size. The only way to implement that suggestion would be to change the upload library to buffer each file as it's compressed, then write the compressed version to the stream. Which could consume a lot of memory, especially since people are working around the file-permissions bug by tarring the files beforehand and uploading a single tar file. So in cases like yours, basically the entire zip output would end up cached in RAM before being streamed out all at once, thereby entirely defeating the point of streaming it.

Streaming in tar format, though, would be quite an easier change to make as the library that upload-action is using already supports tar format.

neilmayhew commented 2 months ago

Downgrading upload-artifact to 4.2.0 so that it uses archiver version 5.3.1 instead of 7.0.1 avoids triggering this bug, though. Something about the way the archive is generated allows unzip-stream to handle it. Perhaps this is just chance, but it could be that some small change to archiver made it less easy to detect Zip64 format.

mhr3 commented 3 weeks ago

@rmunn That was an excellent analysis, seems like unzip-stream could try to handle these cases, care to write a patch for this scenario?