pypa / twine

Utilities for interacting with PyPI
https://twine.readthedocs.io/
Apache License 2.0
1.6k stars 307 forks source link

Twine check tries to open a .tar.gz file as a zip leading to errors #1160

Open mxmlnkn opened 3 days ago

mxmlnkn commented 3 days ago

Is there an existing issue for this?

What keywords did you use to search existing issues?

.tar.gz zip tarball is_zipfile

What operating system are you using?

Linux

If you selected 'Other', describe your Operating System here

No response

What version of Python are you running?

$ python3 --versino
Python 3.12.3

How did you install twine? Did you use your operating system's package manager or pip or something else?

$ pip install twine

What version of twine do you have installed (include the complete output)

$ twine --version
twine version 5.1.1 (importlib-metadata: 4.12.0, keyring: 24.3.1, pkginfo: 1.10.0, requests: 2.31.0, requests-toolbelt: 1.0.0, urllib3: 2.0.7)

Which package repository are you using?

pypi.org

Please describe the issue that you are experiencing

twine check fails for a tarball generated with build. This happens randomly.

Checking dist/ratarmount-0.15.2-py3-none-any.whl: PASSED
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.20/x64/bin/twine", line 8, in <module>
    sys.exit(main())
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/twine/__main__.py", line 33, in main
    error = cli.dispatch(sys.argv[1:])
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/twine/cli.py", line 123, in dispatch
    return main(args.args)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/twine/commands/check.py", line 194, in main
    return check(parsed_args.dists, strict=parsed_args.strict)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/twine/commands/check.py", line 140, in check
    warnings, is_ok = _check_file(filename, render_warning_stream)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/twine/commands/check.py", line 84, in _check_file
    package = package_file.PackageFile.from_filename(filename, comment=None)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/twine/package.py", line 98, in from_filename
    meta = DIST_TYPES[dtype](filename)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pkginfo/sdist.py", line 13, in __init__
    self.extractMetadata()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pkginfo/distribution.py", line 126, in extractMetadata
    data = self.read()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pkginfo/sdist.py", line 40, in read
    archive, names, read_file = self._get_archive(fqn)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pkginfo/sdist.py", line 21, in _get_archive
    archive = zipfile.ZipFile(fqn)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/zipfile.py", line 1268, in __init__
    self._RealGetContents()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/zipfile.py", line 1353, in _RealGetContents
    fp.seek(self.start_dir, 0)
OSError: [Errno 22] Invalid argument
Checking dist/ratarmount-0.15.2.tar.gz: 
Checking dist/ratarmount-0.15.2-py3-none-any.whl: PASSED
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/bin/twine", line 8, in <module>
    sys.exit(main())
             ~~~~^^
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/twine/__main__.py", line 33, in main
    error = cli.dispatch(sys.argv[1:])
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/twine/cli.py", line 123, in dispatch
    return main(args.args)
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/twine/commands/check.py", line 194, in main
    return check(parsed_args.dists, strict=parsed_args.strict)
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/twine/commands/check.py", line 140, in check
    warnings, is_ok = _check_file(filename, render_warning_stream)
                      ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/twine/commands/check.py", line 84, in _check_file
    package = package_file.PackageFile.from_filename(filename, comment=None)
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/twine/package.py", line 98, in from_filename
    meta = DIST_TYPES[dtype](filename)
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/pkginfo/sdist.py", line 13, in __init__
    self.extractMetadata()
    ~~~~~~~~~~~~~~~~~~~~^^
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/pkginfo/distribution.py", line 126, in extractMetadata
    data = self.read()
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/pkginfo/sdist.py", line 40, in read
    archive, names, read_file = self._get_archive(fqn)
                                ~~~~~~~~~~~~~~~~~^^^^^
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/site-packages/pkginfo/sdist.py", line 21, in _get_archive
    archive = zipfile.ZipFile(fqn)
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/zipfile/__init__.py", line 1380, in __init__
    self._RealGetContents()
    ~~~~~~~~~~~~~~~~~~~~~^^
  File "/opt/hostedtoolcache/Python/3.13.0-rc.3/x64/lib/python3.13/zipfile/__init__.py", line 1466, in _RealGetContents
    raise BadZipFile("Bad offset for central directory")
zipfile.BadZipFile: Bad offset for central directory
Checking dist/ratarmount-0.15.2.tar.gz:
##[error]Process completed with exit code 1.

Please list the steps required to reproduce this behaviour

Rune twine check on the attached broken-tarball.gz.

broken-tarball.tar.gz

The Python version does not seem to matter. Platform also does not seem to matter. I have observed these errors in my CI on MacOS, Ubuntu, Python 3.8, Python 3.9, Python 3.12, Python 3.13.

If you really want it fully reproducible, see my CI setup, e.g., in this failing commit.

Anything else you'd like to mention?

I have found out that the underlying problem is with CPython's zipfile.is_zipfile being to lax and my cosmic bad luck to suddenly end up with more or less reproducible tarballs that make zipfile.is_zipfile return True even though it is not a valid ZIP file.

However, I doubt that this gets "fixed" in CPython therefore it would be nice to see this being worked around in twine.

Currently, I have this ugly hack to work around it in my CI:

tarball=$( find dist -name '*.tar.gz' )
if python3 -c 'import sys, zipfile; sys.exit(0 if zipfile.is_zipfile(sys.argv[1]) else 1)' "$tarball"; then
    gzip -c -d "$tarball" > "$tarball.tar"
    for (( i=9; i>0; --i )); do
        cat "$tarball.tar" | gzip -$i > "$tarball"
        if ! python3 -c 'import sys, zipfile; sys.exit(0 if zipfile.is_zipfile(sys.argv[1]) else 1)' "$tarball"
        then break; fi
    done
fi

One possible solution would be to make twine smarter about ZIP detection, e.g., it could use try-except while trying to read one member form the ZIP, which should trigger the errors I am seeing in my CI. Or, it could also use the file suffix for arbitration in case that both "is tarball" and "is zip file" return true.

Personally, I still see this as a structural problem in CPython or even in the ZIP format itself, which probably does not get fixed any time soon:

This makes me fear that there might be other tooling that might fail with similar errors, e.g., during the pip install process and therefore it might not be a good idea to upload such tarballs to PyPI (if it even accepts such a tarball. maybe it gets rejected because it also runs twine check on it). Therefore, my first instinct would to make pypa/build work around this issue somehow by producing tarballs that do not return true for zipfile.is_zipfile, but it seems that they do not want to do this.

It might also be a satisfiable solution to simple print a better error message with suggestions how to work around this issue. That would have saved me 3+ hours of debugging my CI.

layday commented 3 days ago

Therefore, my first instinct would to make pypa/build work around this issue somehow by producing tarballs that do not return true for zipfile.is_zipfile, but it seems that they do not want to do this.

To clarify, build does not produce tarballs at all. You can think of build as something of an orchestrator - it calls out to the build system in your pyproject.toml using a well-defined interface. The build system generates the sdist (setuptools in this case).

mxmlnkn commented 3 days ago

Therefore, my first instinct would to make pypa/build work around this issue somehow by producing tarballs that do not return true for zipfile.is_zipfile, but it seems that they do not want to do this.

To clarify, build does not produce tarballs at all. You can think of build as something of an orchestrator - it calls out to the build system in your pyproject.toml using a well-defined interface. The build system generates the sdist (setuptools in this case).

Thanks for the clarification. Unfortunately, I was not fully aware of this because Python build systems feel fractured and confusing to me. I guess I could also try to ask at pypa/setuptools about this then...

layday commented 3 days ago

Sdists have been standardised as tarballs and no modern builder supports producing zip sdists. I don't know if PyPI still allows the upload of zip sdists or if Twine would like to drop support for them in check. Perhaps if is_tarfile is more reliable, the order of operations could be reversed in pkginfo.

I'm not sure how easy it'd be for setuptools to guarantee that tarballs aren't identifiable as zipfiles, and unless setuptools is doing something super exotic, all builders are probably susceptible to producing misidentifiable tarballs.

sigmavirus24 commented 1 day ago

Potentially not any longer but when Twine started it allowed a source distribution (assistsdist) and a source archive which were two very different kinds of artifacts. They were effectively mutually exclusive at some point. The zip archive handling is a result of that behavior. I haven't checked to see if the source archive is still supported but it's okay to drop that support if pypi no longer allows it

layday commented 1 day ago

:wave: @di, would you know if PyPI still allows zip archives?

sigmavirus24 commented 22 hours ago

:wave: @di, would you know if PyPI still allows zip archives?

PyPI is one repository that Twine attempts to support. Others may still require this using repositories other than PyPI that attempt to behave like PyPI. We have no way of knowing if those folks are also using twine check or expecting it to check the archive metadata.

My recollection is that even 6 or so years ago (I don't remember when I started working on twine) source archives were very rare but still used.

I would hope some PyPI metrics could help close the door here but again it's likely a change needing a major version to let others know that support is being dropped.

mxmlnkn commented 1 hour ago

I tried to create a pull request that simply swaps the is_tarfile / is_zipfile order, only to notice that this is even further down in the software stack in pkginfo. I opened an issue with an attached patch here.