pypa / sampleproject

A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"
https://packaging.python.org/tutorials/packaging-projects/
MIT License
5.13k stars 1.73k forks source link

Sample build breaks if LICENSE.txt file contains GPL2 license text (containing 0x0c) #209

Open osamuaoki opened 9 months ago

osamuaoki commented 9 months ago

Hi,

Thank you for providing very useful resource.

NOTE: I am not raising issue of the license terms of this sampleproject like #17 since this is MIT licensed.

ISSUE

Issue is supportive technical information to adopt this example to the packaging of existing GPL programs.

If I simply replace LICENSE.txt content with the LICENSE file content of GPL license, the resulting wheel package failed to upload to testpypi.org.

 $ twine upload --verbose --repository testpypi dist/*
...
ERROR    HTTPError: 400 Bad Request from https://test.pypi.org/legacy/
         The description failed to render in the default format of reStructuredText. See https://test.pypi.org/help/#description-content-type for more information.

The local test result of such package goes:

 $ python3 -m build;twine check dist/*
 ...
Successfully built sampleproject_osamu_debian_org-3.2.0.tar.gz and sampleproject_osamu_debian_org-3.2.0-py3-none-any.whl
Checking dist/sampleproject_osamu_debian_org-3.2.0-py3-none-any.whl: FAILED
ERROR    `long_description` has syntax errors in markup and would not be rendered on PyPI.                                                                                   
         line 2: Warning: Block quote ends without a blank line; unexpected unindent.                                                                                        
WARNING  `long_description_content_type` missing. defaulting to `text/x-rst`.                                                                                                
Checking dist/sampleproject_osamu_debian_org-3.2.0.tar.gz: PASSED

WORKAROUND

In order to avoid this ERROR to successfully transition to the new pyproject.toml, I need to do the followings on LICENSE.txt file

and live with WARNING or use the last option of commenting out license = { file = "LICENSE.txt"} in pyproject.toml to stop including LICENSE.txt into METADATA file in the wheel package.

I am not sure what is the best solution here.

Desired situation

At least some pointer to the best solution for resolving this situation needs to be easily available to the prospectus users.

I think the build tool should be smart enough to wrap/escape problematic character sequence in the LICENSE.txt to avoid causing issues. But that is not the case now. Requiring to edit LICENSE.txt manually to avoid ERROR seems to be bad idea.

Dropping license = "LICENSE.txt" in pyproject.toml seems to be most practical solution for now. You still get to include unmodified LICENSE into wheel package.

Maybe, it may be a good idea to add link to the root cause bug until it is fixed.

Additional note

As I checked https://github.com/pypa/build , its pyproject.toml has license.file = "LICENSE" but the resulting METADATA doesn't contain it. This build is built by flint-core while the sampleproject uses requires = ["setuptools>=43.0.0", "wheel"] as the build backend.

FYI: I post my packaging experience around this at: https://osamuaoki.github.io/en/2024/02/03/python-pypi/

abravalheri commented 9 months ago

~Hi @osamuaoki , have you tried setting the readme.content-type to text/plain?~

Sorry, wrong hint: this does not seem to solve the problem...

It is probably a "folding" issue with the syntax of the METADATA file. The effect of license.file is to embedded all the text from the file into METADATA. Omitting it relies on Setuptools using the standard rules to add the license file into the wheel as a separated file.

osamuaoki commented 9 months ago

Yah, I was thinking along the same line. Despite the upload ERROR message initially gave me an impression, the real issue seems to be inclusion of LICENCE and its data conversion to fit it into METADATA.

setuptool seems to accept only ReST compatible file. (Just impression. No serious test is done yet.)

flint seems to ignore this license entry in pyproject.toml.

According to my vague memory, I tried something along:

license = { file="LICENSE.txt", content-type="text/plane" }

It didn't fix this issue.

I don't know if this is bug on build tool side or not. Important thing is to inform user for practical workaround for the moment and someone who understand situation to ask other packages to work with this situation etc.

jeanas commented 9 months ago

I couldn't reproduce. After replacing the contents of LICENSE.txt with the GPL text, I still get

$ twine check *
Checking sampleproject-3.0.0-py3-none-any.whl: PASSED
Checking sampleproject-3.0.0.tar.gz: PASSED
abravalheri commented 9 months ago

I cannot reproduce it either... This is what I did in detail:

# docker run --rm -it python:3.12-bookworm /bin/bash
cd /tmp
git clone https://github.com/pypa/sampleproject
cd sampleproject
wget https://www.gnu.org/licenses/gpl-3.0.txt -O LICENSE.txt
sed -i 's/name = "sampleproject"/name = "sampleproject-abc"/g' pyproject.toml

python -m pip install -U pip build twine

python -m build
twine check dist/*
# Checking dist/sampleproject-3.0.0-py3-none-any.whl: PASSED
# Checking dist/sampleproject-3.0.0.tar.gz: PASSED

twine upload --verbose --repository testpypi dist/*
# ...
# INFO     Response from https://test.pypi.org/legacy/:
#         200 OK
# ...
# View at:
# https://test.pypi.org/project/sampleproject-abc/3.0.0/

unzip -p dist/sampleproject_abc-3.0.0-py3-none-any.whl sampleproject_abc-3.0.0.dist-info/METADATA > /tmp/METADATA.orig
head -n 20 /tmp/METADATA.orig
# Metadata-Version: 2.1
# Name: sampleproject-abc
# Version: 3.0.0
# Summary: A sample Python project
# Author-email: "A. Random Developer" <author@example.com>
# Maintainer-email: "A. Great Maintainer" <maintainer@example.com>
# License: GNU GENERAL PUBLIC LICENSE
#                                Version 3, 29 June 2007
#
#          Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
#          Everyone is permitted to copy and distribute verbatim copies
#          of this license document, but changing it is not allowed.
#
#                                     Preamble
#
#           The GNU General Public License is a free, copyleft license for
#         software and other kinds of works.
#
#           The licenses for most software and other practical works are designed
#         to take away your freedom to share and change the works.  By contrast,

wget https://test-files.pythonhosted.org/packages/c0/c0/7b07efedcdcfab3c2700ce7c23d7a737b4b2419f5917f43993c8635f320e/sampleproject_abc-3.0.0-py3-none-any.whl -O /tmp/sampleproject_abc-3.0.0-py3-none-any.whl
unzip -p /tmp/sampleproject_abc-3.0.0-py3-none-any.whl sampleproject_abc-3.0.0.dist-info/METADATA > /tmp/METADATA.download
diff /tmp/METADATA.*
# => no difference

# Further investigate if the escaping works well
# for the "email message" format used in METADATA:
cat <<EOF > /tmp/test-metadata.py
import sys
import email

with open(sys.argv[1], "rb") as fp:
    message = email.message_from_binary_file(fp)

for field, text in (
    ("license", message["License"]),
    ("readme", message.get_payload()),
):
    print(f"------- {field} --------")
    lines = text.splitlines(keepends=True)
    print("".join(lines[:3]))
    print("...")
    print("".join(lines[-3:]))
EOF

python /tmp/test-metadata.py /tmp/METADATA.orig
# ------- license --------
# GNU GENERAL PUBLIC LICENSE
#                                Version 3, 29 June 2007
#
#
# ...
#         Public License instead of this License.  But first, please read
#         <https://www.gnu.org/licenses/why-not-lgpl.html>.
#
# ------- readme --------
# # A sample Python project
#
# ![Python Logo](https://www.python.org/static/community_logos/python-logo.png "Sample inline image")
#
# ...
# [rst]: http://docutils.sourceforge.net/rst.html
# [md]: https://tools.ietf.org/html/rfc7764#section-3.5 "CommonMark variant"
# [md use]: https://packaging.python.org/specifications/core-metadata/#description-content-type-optional

# ---- Eventhing seems to work well in the METADATA file ----

Anyway, I think this repository is not the best place to be discussing this issue. Once the author has a complete minimal reproducer, the following possibilities might be investigated:

  1. Escaping problem, to be reported/investigated in the pypa/setuptools project (later, after the investigation is complete, it might also involve pypa/distutils and/or pypa/wheel).
  2. Validation problem, to be reported/investigated in the pypi/warehouse project.
  3. Difference between twine check and the checks in pypi, to be discussed in the pypa/twine and/or pypi/warehouse projects.

However it is important to have a clear idea what item(s) in the list above is actually happening, before opening any new issue; and verify them with solid reproducers.

Please feel free to use the quick script in the example above, to check if the METADATA file is malformed... It is by no means an exhaustive check though but maybe when added to twine check is enough?

osamuaoki commented 9 months ago

I should have been more explicit which LICENSE text I was using. I am using good old GPL2 text as https://github.com/osamuaoki/imediff/blob/main/LICENSE and renamed it to LICENSE.txt . This file is causing problem.

With your test case with GPL3 text, I also get no problem on Debian/stable:

$ wget https://www.gnu.org/licenses/gpl-3.0.txt -O LICENSE.txt
$ python3 -m build
$ twine check dist/*
Checking dist/sampleproject-3.0.0-py3-none-any.whl: PASSED
Checking dist/sampleproject-3.0.0.tar.gz: PASSED

So the problem is specific to the GPL2 text. Quotation styles are different.

My GPL2 file uses `foo'
Your GPL3 file uses "foo"

FYI I am using Debian 12/ stable as the base. This means Python3.11. (I also used the latest PIP packages under venv to be current)

abravalheri commented 9 months ago

Thank you for the reproducer. This seems to be the same as https://github.com/pypa/setuptools/issues/4033. I would say the root of the problem is the \x0c characters in the license text. Isn't it?

Quoting Paul Moore in the thread I linked above:

This is because neither the metadata spec nor the email standards really say what to do with control characters in header values, and the stdlib email package appears to handle it inconsistently.

The escaping goes wrong when pypa/wheel tries to regenerated the metadata file by ingesting the file that pypa/setuptools generates. Per se, the file generated in pypa/setuptools parses correctly, but after being re-generated by pypa/wheel, it does not.

Recently I submitted a PR to pypa/distutils that will make the life of pypa/wheel easier, and prevent this round-trip problem from happening all togheter. However pypa/distutils has not been merged into pypa/setuptools yet.

If the problem is indeed \x0c, I recommend closing this issue since the problem is already being tracked in https://github.com/pypa/setuptools/issues/4033.

As a workaround you can replace all the \x0c characters in your license file with a \n, or simply remove out the licence = ... in your pyproject.toml as you have previously suggested (the license file will still be included in the wheel as a standalone file, so you are still distributing the license).

pfmoore commented 9 months ago

I'm confused. The reported error is:

ERROR    `long_description` has syntax errors in markup and would not be rendered on PyPI.                                                                                   
         line 2: Warning: Block quote ends without a blank line; unexpected unindent.                                                                                        
WARNING  `long_description_content_type` missing. defaulting to `text/x-rst`.

That's about the README, not the license, and it's saying that the README isn't valid Restructured text. It looks like whatever changes the OP made to the readme (or to the filename of the readme, because if it were still a .md file it would be detected as Markdown) are the problem here.

osamuaoki commented 9 months ago

I can confirm the problematic content is in the LICENSE text not in README.

Good old FormFeed aka. FF (0x0c) in LICENSE was the real root cause. By removing them, I get clean result with no ERROR nor WARNING. So the problem is in setuptools.

As for the unhelpful ERROR message which seems to imply the problem reside in README.md or content-type setting... I don't know how exactly it happens to say this but this error report message has some room to be improved to lead people to the root cause.

abravalheri commented 9 months ago

I'm confused. The reported error is:

Hi @pfmoore, I am trying to summarise my thought process below. Hopefully it will improve the understanding 🤞.

When the stdlib's email functions are used to read the METADATA files, they don't care if there is a \xc0 character... It seems to deal with it very well[^1]. So that is why the PKG-INFO inside tar.gz sdist parses correctly.

However, when pypa/wheel regenerates the METADATA file from PKG-INFO it replaces the \xc0 character with a \n. If you happen to have in your license file a \n\xc0 or \xc0\n sequence, this means that pypa/wheel will generate a METADATA file with \n\n.

Now, in the email message format, the \n\n sequence is used to mark the end of the headers section and the beginning of the body/payload. In the packaging standard, we use the body/payload of the email message for the README/long_description.

So the error message is complaining about the README/long_description, because part of the LICENSE file is incorrectly being considered as the email message body/payload.

This is the same error discussed in pypa/setuptools#4033, for which pypa/distutils#213 should provide a good workaround (we have to wait until it lands on setuptools).

[^1]: And/or at least the functions used for parsing in twine/PyPI.

pfmoore commented 9 months ago

Ah, I see. Thanks - that's a lot more subtle than I realised.

osamuaoki commented 9 months ago

Hi, looking at https://github.com/pypa/setuptools/issues/4033 , problematic control codes are:

 '0x0b': ('\x0b', ['', '']), 
 '0x0c': ('\x0c', ['', '']),
 '0x1c': ('\x1c', ['', '']),
 '0x1d': ('\x1d', ['', '']),
 '0x1e': ('\x1e', ['', '']),

In other words

Oct   Dec   Hex   Char                     
─────────────────────────────
013   11    0B    VT  '\v' (vertical tab)  Ctrl-K
014   12    0C    FF  '\f' (form feed)     Ctrl-L
034   28    1C    FS  (file separator)     Ctrl-\
035   29    1D    GS  (group separator)    Ctrl-]
036   30    1E    RS  (record separator)   Ctrl-^

These needs to be removed from the original LICENSE file to make source code robust and portable to older platforms.

As for the following text in pyproject.toml:

# This is either text indicating the license for the distribution, or a file
# that contains the license
# https://packaging.python.org/en/latest/specifications/core-metadata/#license
license = {file = "LICENSE.txt"}

I suggest updating this section with:

# This is either text indicating the license for the distribution, or a file
# that contains the license
# For the better compatibility with the older tool chain, please remove control
# characters (0x0b=^K, 0x0c=^L, 0x1c=^\, 0x1d=^], 0x1e=^^) in the` LICENSE.txt`
# file if they exist.  See https://github.com/pypa/setuptools/issues/4033  .
# https://packaging.python.org/en/latest/specifications/core-metadata/#license
license = {file = "LICENSE.txt"}
jeanas commented 9 months ago

IMHO, we should just fix that bug and not document the workaround, because pip and build use build isolation by default and most people never change that setting, which should fix the problem without people needing to upgrade their toolchain as soon as a new release of wheel is out.

osamuaoki commented 9 months ago

I see your point and agree.

Maybe leave this issue open until the new release of tool chain.