Closed jmottier-niryo closed 1 month ago
Hi @jmottier-niryo,
Thanks for the report - will have a nose around. I probably won't be able to download the file unless I can figure out some sort of sandbox for it... but the rest of the information you've given allows me to at least nose around Python's zipfile for example to see if it handles symlinks differently somehow.
Thanks!
Michal
Actually @jmottier-niryo, are you able to share the file name that errors, its file size as reported by stream_unzip, and also the bytes that it manages to extract before it hits the exception? If it's a symlink then it shouldn't be that many bytes
Ah: and what package/program made the zipfile?
While trying to gather the information you requested, I encountered some strange behaviors.
Although directories have a file_size
of 0, all the files show a file_size
of None. To calculate the total size of the extracted bytes, I used the length of each extracted chunk instead.
Here is the code I used:
with open('PATH/TO/ZIPFILE', 'rb') as zip_file:
total_size = 0
for file_name, file_size, unzipped_chunks in stream_unzip(zip_file):
try:
real_file_size = sum(len(chunk) for chunk in unzipped_chunks)
total_size += real_file_size
print(f'{file_name=}, {file_size=}, {real_file_size=}')
except CRC32IntegrityError:
print('###### ERROR ######')
print(f'CRC32 Integrity Error: {file_name=}, {file_size=}')
print(f'n bytes extracted: {total_size}')
return
And here are the logs produced by the script: out_1.log
Here are the last 10 lines of the output in case you don't want to download the file:
As you can see the script managed to extract 19,210 files for a total of 62,218,274 bytes before raising the exception for the file devel/release/_setup_util.py
.
What's very interesting is that its file_size
is 0, and not None like all the other regular files. I can't get its real size since I would have to iterate over its chunks to get it.
Concerning the way the zipfile is created, it's just the artifacts
keyword in our .gitlab-ci.yml
file:
artifacts:
expire_in: 1 week
paths:
- build
- devel
- install
- src
We have zero control over how the zipfile is generated
Regarding concerns about safely extracting the zip file, wouldn't running the extraction in a Docker container provide the necessary sandbox environment ?
What's very interesting is that its file_size is 0, and not None like all the other regular files.
Ah this is interesting. So, the None/0 thing: if it's None, then in the ZIP file the CRC32 and size of the file is stored just after the data of the member file, in its so-called "data descriptor". If it's 0, then both the CRC32 and file size are stored just before. And the size being 0 is also essentially "wrong" it sounds like... if this is a symbolic link, then the path to the target of the link should be bytes of the member file, and there would be more than one byte here.
All this hints at why unzip and Python's zipfile can unzip the file: both of them are not stream unzippers, and use metadata in the "central directory" way at the end of the ZIP, not the metadata just before/after the data of each member file like stream unzip does.
It's sounding a bit like the ZIP is actually invalid... the local metadata in this particular file is actually wrong...
Concerning the way the zipfile is created, it's just the artifacts keyword in our .gitlab-ci.yml file:
... So I'm wondering if there is an issue with GitLab here? We might be hitting a limitation of how a stream unzipper can behave and so there might not realistically be any action to take in this project.
Regarding concerns about safely extracting the zip file, wouldn't running the extraction in a Docker container provide the necessary sandbox environment ?
Possibly, but to download from a Google Drive link I think would have to initially download outside of Docker?
Navigating the GitLab source code, specifically its runner that I think creates the archives, is https://github.com/saracen/fastzip ultimately the library that's used under the hood? I wonder if there is some symlink issue there...
You mentioned:
The symlink points to a file that doesn’t exist on the system where it’s being extracted (this is intentional).
Does the target of the symlink exist in the runner itself when the zip is created?
I managed to reproduce the case with a minimal reproducible file with this job:
test_job:
stage: build
script:
- mkdir mydir
- echo "Hello, World!" > mydir/hello.txt
- ln -s hello.txt mydir/hello.link.txt
artifacts:
paths:
- mydir
Here is the artifact encoded in base64, like this you have a handy way to test it into any sandbox:
With the same script as above I get the following output:
file_name=b'mydir/', file_size=0, real_file_size=0
###### ERROR ######
CRC32 Integrity Error: file_name=b'mydir/hello.link.txt', file_size=0
n bytes extracted: 0
Whether the symlink points to a valid file or not does not change anything
So I guess you're right, the problem comes from the way symlinks are handled by the GitLab zip lib.
But I'm wondering if it's really a bug on their side or if it's just a way to zip files which is incompatible with stream-unzip
So I have now recreated the issue with the minimal file
But I'm wondering if it's really a bug on their side or if it's just a way to zip files which is incompatible with stream-unzip
So nosing around the file and stream-unzip... I think the file is valid, and you're right, it is just a way to zip files that is incompatible with stream-unzip (or indeed other stream unzippers like https://github.com/madler/sunzip)
I think stream-unzip could be more helpful in this case though - the CRC32 error is just confusing (and some of what I said above about the meaning of the 0
as the file size is just wrong)
That being said, I would argue there is an improvement to be made with how GitLab (or whichever library it uses) makes ZIP files: it unnecessarily makes the file non-stream-unzippable when symlinks are used, because it stores the symlink as non-compressed data ("STORED") but also with a data descriptor which means the length of the member file is after the data, not before. A stream unzipper can handle a data descriptor with compressed data, because the compressed data stream itself indicates when it ends, but that isn't possible with arbitrary uncompressed data.
So I think the upstream library should either store symlinks as compressed, or without using a data descriptor, and I would guess that no data descriptor would be the better option.
From working on the error at https://github.com/uktrade/stream-unzip/pull/106 realised there are actually 3 things that upstream could do:
However, 3. results in a file that is technically against the ZIP spec, but infozip in at least some cases makes such a ZIP (which is how I found the case), so there is precedent. (For completeness the case specifically is a non-compressed file that is password protected. Maybe there are others)
https://github.com/uktrade/stream-unzip/pull/106 now released in v0.0.95 (which doesn't give your case much other than a clearer error: NotStreamUnzippable)
Have also now made it clearer that some ZIPs cannot be stream-unzipped in the docs: https://stream-unzip.docs.trade.gov.uk/limitations/
I think there isn't really more to be done in this project on this issue. I think the only way to really address this issue is via the library that makes ZIPs in GitLab, and the issue is now raised there https://github.com/saracen/fastzip/issues/48. So closing here (I am semi-guessing that's the library... a chance it could be something else...)
Description
I'm encountering issues when unzipping a specific zip file. It’s a GitLab artifact generated by our CI/CD pipeline. The issue occurs only with this particular artifact, and I receive a
CRC32IntegrityError
during the unzip process.Context
unzip
CLI or Python'szipfile
library.Code to reproduce
File to reproduce
Unfortunately, I can't provide a minimal reproducible file, as the issue only appears with this specific artifact. The archive is about 300MB and contains a build of a ROS (Robot Operating System) stack.
Link to the file: Gitlab Artifact