uktrade / stream-unzip

Python function to stream unzip all the files in a ZIP archive on the fly
https://stream-unzip.docs.trade.gov.uk/
MIT License
279 stars 14 forks source link

CRC32IntegrityError while extracting #105

Closed jmottier-niryo closed 1 month ago

jmottier-niryo commented 1 month ago

Description

I'm encountering issues when unzipping a specific zip file. It’s a GitLab artifact generated by our CI/CD pipeline. The issue occurs only with this particular artifact, and I receive a CRC32IntegrityError during the unzip process.

Context

Code to reproduce

with open('PATH/TO/ZIPFILE', 'rb') as zip_file:
    for file_name, file_size, unzipped_chunks in stream_unzip(zip_file):
        try:
            for _ in unzipped_chunks:
                pass
        except CRC32IntegrityError as e:
            print(f'CRC32 integrity error on file {file_name}')
            raise e

File to reproduce

Unfortunately, I can't provide a minimal reproducible file, as the issue only appears with this specific artifact. The archive is about 300MB and contains a build of a ROS (Robot Operating System) stack.

Link to the file: Gitlab Artifact

michalc commented 1 month ago

Hi @jmottier-niryo,

Thanks for the report - will have a nose around. I probably won't be able to download the file unless I can figure out some sort of sandbox for it... but the rest of the information you've given allows me to at least nose around Python's zipfile for example to see if it handles symlinks differently somehow.

Thanks!

Michal

michalc commented 1 month ago

Actually @jmottier-niryo, are you able to share the file name that errors, its file size as reported by stream_unzip, and also the bytes that it manages to extract before it hits the exception? If it's a symlink then it shouldn't be that many bytes

michalc commented 1 month ago

Ah: and what package/program made the zipfile?

jmottier-niryo commented 1 month ago

While trying to gather the information you requested, I encountered some strange behaviors.

Although directories have a file_size of 0, all the files show a file_size of None. To calculate the total size of the extracted bytes, I used the length of each extracted chunk instead.

Here is the code I used:

with open('PATH/TO/ZIPFILE', 'rb') as zip_file:
    total_size = 0
    for file_name, file_size, unzipped_chunks in stream_unzip(zip_file):
        try:
            real_file_size = sum(len(chunk) for chunk in unzipped_chunks)
            total_size += real_file_size
            print(f'{file_name=}, {file_size=}, {real_file_size=}')
        except CRC32IntegrityError:
            print('###### ERROR ######')
            print(f'CRC32 Integrity Error: {file_name=}, {file_size=}')
            print(f'n bytes extracted: {total_size}')
            return

And here are the logs produced by the script: out_1.log

Here are the last 10 lines of the output in case you don't want to download the file:

out_1.log ``` file_name='devel/release/.private/ttl_driver/share/roseus/ros/ttl_driver/srv/WriteVelocityProfile.l', file_size=None, real_file_size=7230 file_name='devel/release/.private/ttl_driver/share/ttl_driver/', file_size=0, real_file_size=0 file_name='devel/release/.private/ttl_driver/share/ttl_driver/cmake/', file_size=0, real_file_size=0 file_name='devel/release/.private/ttl_driver/share/ttl_driver/cmake/ttl_driver-msg-extras.cmake', file_size=None, real_file_size=961 file_name='devel/release/.private/ttl_driver/share/ttl_driver/cmake/ttl_driver-msg-paths.cmake', file_size=None, real_file_size=227 file_name='devel/release/.private/ttl_driver/share/ttl_driver/cmake/ttl_driverConfig-version.cmake', file_size=None, real_file_size=426 file_name='devel/release/.private/ttl_driver/share/ttl_driver/cmake/ttl_driverConfig.cmake', file_size=None, real_file_size=9687 ###### ERROR ###### CRC32 Integrity Error: file_name='devel/release/_setup_util.py', file_size=0 n bytes extracted: 62218274 ```

As you can see the script managed to extract 19,210 files for a total of 62,218,274 bytes before raising the exception for the file devel/release/_setup_util.py. What's very interesting is that its file_size is 0, and not None like all the other regular files. I can't get its real size since I would have to iterate over its chunks to get it.

Concerning the way the zipfile is created, it's just the artifacts keyword in our .gitlab-ci.yml file:

  artifacts:
    expire_in: 1 week
    paths:
      - build
      - devel
      - install
      - src

We have zero control over how the zipfile is generated

Regarding concerns about safely extracting the zip file, wouldn't running the extraction in a Docker container provide the necessary sandbox environment ?

michalc commented 1 month ago

What's very interesting is that its file_size is 0, and not None like all the other regular files.

Ah this is interesting. So, the None/0 thing: if it's None, then in the ZIP file the CRC32 and size of the file is stored just after the data of the member file, in its so-called "data descriptor". If it's 0, then both the CRC32 and file size are stored just before. And the size being 0 is also essentially "wrong" it sounds like... if this is a symbolic link, then the path to the target of the link should be bytes of the member file, and there would be more than one byte here.

All this hints at why unzip and Python's zipfile can unzip the file: both of them are not stream unzippers, and use metadata in the "central directory" way at the end of the ZIP, not the metadata just before/after the data of each member file like stream unzip does.

It's sounding a bit like the ZIP is actually invalid... the local metadata in this particular file is actually wrong...

Concerning the way the zipfile is created, it's just the artifacts keyword in our .gitlab-ci.yml file:

... So I'm wondering if there is an issue with GitLab here? We might be hitting a limitation of how a stream unzipper can behave and so there might not realistically be any action to take in this project.

Regarding concerns about safely extracting the zip file, wouldn't running the extraction in a Docker container provide the necessary sandbox environment ?

Possibly, but to download from a Google Drive link I think would have to initially download outside of Docker?

michalc commented 1 month ago

Navigating the GitLab source code, specifically its runner that I think creates the archives, is https://github.com/saracen/fastzip ultimately the library that's used under the hood? I wonder if there is some symlink issue there...

You mentioned:

The symlink points to a file that doesn’t exist on the system where it’s being extracted (this is intentional).

Does the target of the symlink exist in the runner itself when the zip is created?

jmottier-niryo commented 1 month ago

I managed to reproduce the case with a minimal reproducible file with this job:

test_job:
  stage: build
  script:
    - mkdir mydir
    - echo "Hello, World!" > mydir/hello.txt
    - ln -s hello.txt mydir/hello.link.txt
  artifacts:
    paths:
      - mydir

Here is the artifact encoded in base64, like this you have a handy way to test it into any sandbox:

artifacts.zip.base64 ``` UEsDBBQAAAgAAGlERFkAAAAAAAAAAAAAAAAGACEAbXlkaXIvdXgLAAEEAAAAAAQAAAAAVVQFAAHH qP9mVVQFAAHHqP9mUEsDBBQACAgAAGlERFkAAAAAAAAAAAAAAAAUACEAbXlkaXIvaGVsbG8ubGlu ay50eHR1eAsAAQQAAAAABAAAAABVVAUAAceo/2ZVVAUAAceo/2ZoZWxsby50eHRQSwcIu85gEgkA AAAJAAAAUEsDBBQACAgIAGlERFkAAAAAAAAAAAAAAAAPACEAbXlkaXIvaGVsbG8udHh0dXgLAAEE AAAAAAQAAAAAVVQFAAHHqP9mVVQFAAHHqP9m8kjNycnXUQjPL8pJUeQCBAAA//9QSwcIhJ7otBQA AAAOAAAAUEsBAhQDFAAACAAAaUREWQAAAAAAAAAAAAAAAAYAIQAAAAAAAAAQAO1BAAAAAG15ZGly L3V4CwABBAAAAAAEAAAAAFVUBQABx6j/ZlVUBQABx6j/ZlBLAQIUAxQACAgAAGlERFm7zmASCQAA AAkAAAAUACEAAAAAAAAAAAD/oUUAAABteWRpci9oZWxsby5saW5rLnR4dHV4CwABBAAAAAAEAAAA AFVUBQABx6j/ZlVUBQABx6j/ZlBLAQIUAxQACAgIAGlERFmEnui0FAAAAA4AAAAPACEAAAAAAAAA AACkgbEAAABteWRpci9oZWxsby50eHR1eAsAAQQAAAAABAAAAABVVAUAAceo/2ZVVAUAAceo/2ZQ SwUGAAAAAAMAAwAWAQAAIwEAAAAA ``` To create the zipfile you just have to do `base64 -d artifacts.zip.base64 > artifacts.zip`

With the same script as above I get the following output:

file_name=b'mydir/', file_size=0, real_file_size=0
###### ERROR ######
CRC32 Integrity Error: file_name=b'mydir/hello.link.txt', file_size=0
n bytes extracted: 0

Whether the symlink points to a valid file or not does not change anything

So I guess you're right, the problem comes from the way symlinks are handled by the GitLab zip lib.

But I'm wondering if it's really a bug on their side or if it's just a way to zip files which is incompatible with stream-unzip

michalc commented 1 month ago

So I have now recreated the issue with the minimal file

But I'm wondering if it's really a bug on their side or if it's just a way to zip files which is incompatible with stream-unzip

So nosing around the file and stream-unzip... I think the file is valid, and you're right, it is just a way to zip files that is incompatible with stream-unzip (or indeed other stream unzippers like https://github.com/madler/sunzip)

I think stream-unzip could be more helpful in this case though - the CRC32 error is just confusing (and some of what I said above about the meaning of the 0 as the file size is just wrong)

That being said, I would argue there is an improvement to be made with how GitLab (or whichever library it uses) makes ZIP files: it unnecessarily makes the file non-stream-unzippable when symlinks are used, because it stores the symlink as non-compressed data ("STORED") but also with a data descriptor which means the length of the member file is after the data, not before. A stream unzipper can handle a data descriptor with compressed data, because the compressed data stream itself indicates when it ends, but that isn't possible with arbitrary uncompressed data.

So I think the upstream library should either store symlinks as compressed, or without using a data descriptor, and I would guess that no data descriptor would be the better option.

michalc commented 1 month ago

From working on the error at https://github.com/uktrade/stream-unzip/pull/106 realised there are actually 3 things that upstream could do:

  1. Store the symlinks as compressed
  2. Store the symlinks as non-compressed but without a data descriptor
  3. Store the symlinks as non-compressed, with a data descriptor, but also giving their lengths in the local header as long as they are non-zero (which I think must be the case since I suspect you can't have a zero-length target)

However, 3. results in a file that is technically against the ZIP spec, but infozip in at least some cases makes such a ZIP (which is how I found the case), so there is precedent. (For completeness the case specifically is a non-compressed file that is password protected. Maybe there are others)

michalc commented 1 month ago

https://github.com/uktrade/stream-unzip/pull/106 now released in v0.0.95 (which doesn't give your case much other than a clearer error: NotStreamUnzippable)

michalc commented 1 month ago

Have also now made it clearer that some ZIPs cannot be stream-unzipped in the docs: https://stream-unzip.docs.trade.gov.uk/limitations/

I think there isn't really more to be done in this project on this issue. I think the only way to really address this issue is via the library that makes ZIPs in GitLab, and the issue is now raised there https://github.com/saracen/fastzip/issues/48. So closing here (I am semi-guessing that's the library... a chance it could be something else...)