python / cpython

The Python programming language
https://www.python.org
Other
62.3k stars 29.93k forks source link

inconsistent handling of duplicate ZipFile entries #117779

Open obfusk opened 5 months ago

obfusk commented 5 months ago

Bug report

Bug description:

Create a ZIP file with duplicate central directory entries pointing to the same local file header (these can be found in the wild, see e.g. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1068705, this is just an easy way to create one for testing).

>>> import zipfile
>>> with zipfile.ZipFile("foo.zip", "w") as zf:
...     info = zipfile.ZipInfo(filename="foo")
...     zf.writestr(info, "FOO")
...     zf.filelist.append(info)

Opening the duplicate entry fails if using the name or the later entry in infolist(), but works using the earlier entry (since the later one is considered to overlap with the earlier one, but the earlier one isn't considered to overlap with another entry or the central directory).

>>> import zipfile
>>> zf = zipfile.ZipFile("foo.zip")
>>> zf.infolist()[0]
<ZipInfo filename='foo' filemode='?rw-------' file_size=3>
>>> zf.infolist()[1]
<ZipInfo filename='foo' filemode='?rw-------' file_size=3>
>>> zf.open("foo") # fails
zipfile.BadZipFile: Overlapped entries: 'foo' (possible zip bomb)
>>> zf.open(zf.infolist()[1]) # fails
zipfile.BadZipFile: Overlapped entries: 'foo' (possible zip bomb)
>>> zf.open(zf.infolist()[0]) # works fine
<zipfile.ZipExtFile name='foo' mode='r'>

If I modify NameToInfo to contain the earlier entry instead, f.open("foo") works fine. On the one hand these ZIP files are broken. On the other hand, it would be easy to simply not overwrite existing entries in NameToInfo, allowing these files to be opened. And this affects real-world programs trying to open real-world files. So it could be considered a regression caused by #110016). Perhaps a warning would be in order when duplicates are detected; e.g. unzip shows an error but does extract the files.

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux

h01ger commented 5 months ago

obfusk: thank you for also filing a bug here!

raininja commented 2 weeks ago

This is still evident, and rather annoying!

[ 18/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   3
[ 19/291] Writing tensor blk.1.ffn_norm.weight                  | size   4096           | type F32  | T+   3
[ 20/291] Writing tensor blk.2.attn_q.weight                    | size   4096 x   4096  | type F32  | T+   3
[ 21/291] Writing tensor blk.2.attn_k.weight                    | size   4096 x   4096  | type F32  | T+   3
[ 22/291] Writing tensor blk.2.attn_v.weight                    | size   4096 x   4096  | type F32  | T+   3
[ 23/291] Writing tensor blk.2.attn_output.weight               | size   4096 x   4096  | type F32  | T+   3
Traceback (most recent call last):
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 1219, in <module>
    main()
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 1214, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab, concurrency = args.concurrency, endianess=endianess)
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 941, in write_all
    for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)):
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 781, in bounded_parallel_map
    result = futures.pop(0).result()
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 905, in do_item
    tensor = lazy_tensor.load().to_ggml()
             ^^^^^^^^^^^^^^^^^^
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 510, in load
    ret = self._load()
          ^^^^^^^^^^^^
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 520, in load
    return self.load().astype(data_type)
           ^^^^^^^^^^^
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 510, in load
    ret = self._load()
          ^^^^^^^^^^^^
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 668, in load
    return UnquantizedTensor(storage.load(storage_offset, elm_count).reshape(size))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raijin/aur/powerinfer/PowerInfer/convert-dense.py", line 652, in load
    fp = self.zip_file.open(info)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/zipfile/__init__.py", line 1652, in open
    raise BadZipFile(f"Overlapped entries: {zinfo.orig_filename!r} (possible zip bomb)")
zipfile.BadZipFile: Overlapped entries: 'pytorch_model-00001-of-00003/data/23' (possible zip bomb)

this conversion works with python 3.10.4

[290/291] Writing tensor output_norm.weight                     | size   4096           | type F32  | T+ 110
[291/291] Writing tensor output.weight                          | size  32000 x   4096  | type F32  | T+ 110
Wrote /home/raijin/workspace/model_repo/ReluLLaMA-7B-PowerInfer-GGUF/ReluLLaMA-7B.powerinfer.gguf