python / cpython

The Python programming language
https://www.python.org/
Other
61.36k stars 29.56k forks source link

Wrong extract version set in local headers for files located beyond zip64 limit #121171

Open dtheyer opened 4 weeks ago

dtheyer commented 4 weeks ago

Bug report

Bug description:

When creating a zip file which contains a file large enough to push additional files beyond the ZIP64_LIMIT, the extract version is correctly incremented to 4.5 in the Central Header, but not in the Local Header. This mismatch is usually ignored, but more picky implementations fail when validating the zip file.

When the Local Header is being generated, the extract version that is used does not account for the fact that if the header is located farther than the ZIP64_LIMIT, the ZIP64_VERSION is needed since to reference the header offset the Central Header needs to store the offset in the Extra Bits.

Repro

Create a directory with a file over the ZIP64_LIMIT and a file after it, then zip the directory:

>tree source_dir/ -h
source_dir/
├── [4.7G]  5_gig_file.dat
└── [ 229]  t
# Zip the test directory
out_file = "outputs/test.zip"
source_dir = "source_dir"

with zipfile.ZipFile(out_file, "w") as zipf:
    for root, _dirs, files in os.walk(source_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arc_path = pathlib.Path(file_path).relative_to(source_dir)
            zipf.write(file_path, arcname=arc_path)

print(f"generated {out_file}")

Using zipdetails we can see the extract_version in the local header does not match the central header:

> zipdetails outputs/test.zip

...

12A05F240 LOCAL HEADER #2       04034B50
12A05F244 Extract Zip Spec      14 '2.0'
12A05F245 Extract OS            00 'MS-DOS'
12A05F246 General Purpose Flag  0000
12A05F248 Compression Method    0000 'Stored'
12A05F24A Last Mod Time         58DCB5CE 'Fri Jun 28 22:46:28 2024'
12A05F24E CRC                   BB63C172
12A05F252 Compressed Length     000000E5
12A05F256 Uncompressed Length   000000E5
12A05F25A Filename Length       0008
12A05F25C Extra Length          0000
12A05F25E Filename              'test.TXT'
...

12A05F39B CENTRAL HEADER #2     02014B50
12A05F39F Created Zip Spec      2D '4.5'
12A05F3A0 Created OS            00 'MS-DOS'
12A05F3A1 Extract Zip Spec      2D '4.5'
12A05F3A2 Extract OS            00 'MS-DOS'
12A05F3A3 General Purpose Flag  0000
12A05F3A5 Compression Method    0000 'Stored'
12A05F3A7 Last Mod Time         58DCB5CE 'Fri Jun 28 22:46:28 2024'
12A05F3AB CRC                   BB63C172
12A05F3AF Compressed Length     000000E5
12A05F3B3 Uncompressed Length   000000E5
12A05F3B7 Filename Length       0008
12A05F3B9 Extra Length          000C
12A05F3BB Comment Length        0000
12A05F3BD Disk Start            0000
12A05F3BF Int File Attributes   0000
          [Bit 0]               0 'Binary Data'
12A05F3C1 Ext File Attributes   81B60000
12A05F3C5 Local Header Offset   FFFFFFFF
12A05F3C9 Filename              'test.TXT'
12A05F3D1 Extra ID #0001        0001 'ZIP64'
12A05F3D3   Length              0008
12A05F3D5   Offset to Central   000000012A05F240
          Dir

12A05F3DD ZIP64 END CENTRAL DIR 06064B50
          RECORD
12A05F3E1 Size of record        000000000000002C
12A05F3E9 Created Zip Spec      2D '4.5'
12A05F3EA Created OS            00 'MS-DOS'
12A05F3EB Extract Zip Spec      2D '4.5'
12A05F3EC Extract OS            00 'MS-DOS'
12A05F3ED Number of this disk   00000000
12A05F3F1 Central Dir Disk no   00000000
12A05F3F5 Entries in this disk  0000000000000002
12A05F3FD Total Entries         0000000000000002
12A05F405 Size of Central Dir   0000000000000092
12A05F40D Offset to Central dir 000000012A05F34B

12A05F415 ZIP64 END CENTRAL DIR 07064B50
          LOCATOR
12A05F419 Central Dir Disk no   00000000
12A05F41D Offset to Central dir 000000012A05F3DD
12A05F425 Total no of Disks     00000001

12A05F429 END CENTRAL HEADER    06054B50
12A05F42D Number of this disk   0000
12A05F42F Central Dir Disk no   0000
12A05F431 Entries in this disk  0002
12A05F433 Total Entries         0002
12A05F435 Size of Central Dir   00000092
12A05F439 Offset to Central Dir FFFFFFFF
12A05F43D Comment Length        0000

The zip -F command also warns about the mismatch:

zip --version
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.
...
> zip -F outputs/test.zip  --out out2.zip
Fix archive (-F) - assume mostly intact archive
Zip entry offsets do not need adjusting
 copying: 5_gig_file.dat
 copying: test.TXT
        zip warning: Local Version Needed To Extract does not match CD: test.TXT

CPython versions tested on:

3.10, CPython main branch

Operating systems tested on:

Linux, Windows

Linked PRs

dtheyer commented 4 weeks ago

@AlexMaynesPointon and I also looked into a fix in https://github.com/python/cpython/pull/121172

danifus commented 4 weeks ago

Ah, sorry! I missed that you were creating a PR too. Since you did the hard work of finding and triaging I'm happy for your PR to go ahead of mine. Are you able to copy the tests from mine into your pull request?

This issue also affected the zip64 file count limit threshold which you could fix at the same time (there's a test in my branch for that). Also, the FileHeader(zip64) call takes a zip64 argument that indicates when the zip64 stuff needs to be written (currently just used if the compressed or uncompressed file size is larger than the individual file threshold is exceeded). I think using that existing machinery would be more straightforward than modifying the ZipInfo directly.

Let me know when you're ready and I'll do a review :)

AlexMaynesPointon commented 4 weeks ago

No worries, your changes look much more complete, I am just happy to see this fixed. Sorry for not correctly tagging the pull request, it is my first time trying to submit any pull requests. Since your changes cover everything I did and more, it makes more sense to just merge your pull request.

danifus commented 4 weeks ago

Sorry for the confusion. Thanks for the detailed bug report!