Open yurinnick opened 1 year ago
What's the output when using a valid build-system section, and what warnings get shown when using build to make the wheel?
Add build-system
section:
...
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"
Install and build the wheel with broken license file:
$ pip install -e .
Obtaining file:///home/yurinnick/Documents/test
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Preparing editable metadata (pyproject.toml) ... done
Building wheels for collected packages: unicode-bug
Building editable for unicode-bug (pyproject.toml) ... done
Created wheel for unicode-bug: filename=unicode_bug-0.1.0-0.editable-py3-none-any.whl size=2922 sha256=4277b75c0ebbf7978dc5a79089f9d5b8cce84d2179e10aa837c2069139ff977d
Stored in directory: /tmp/pip-ephem-wheel-cache-6sqe4pu4/wheels/5e/bb/78/14fd4ec013baed754dd266aec43868a061011260e628caafc4
Successfully built unicode-bug
Installing collected packages: unicode-bug
Attempting uninstall: unicode-bug
Found existing installation: unicode-bug 0.1.0
Uninstalling unicode-bug-0.1.0:
Successfully uninstalled unicode-bug-0.1.0
Successfully installed unicode-bug-0.1.0
$ pip wheel .
Processing /home/yurinnick/Documents/test
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: unicode-bug
Building wheel for unicode-bug (pyproject.toml) ... done
Created wheel for unicode-bug: filename=unicode_bug-0.1.0-py3-none-any.whl size=1425 sha256=ff6fb74aa5124491036b8ab8f8788910f7a6ba9b5d65285db8568f45e99975a9
Stored in directory: /tmp/pip-ephem-wheel-cache-mk_5fqz9/wheels/5e/bb/78/14fd4ec013baed754dd266aec43868a061011260e628caafc4
Successfully built unicode-bug
So no new warnings or errors as far as I can tell.
I meant the build package,which has easier opt out of build output hiding
python -m build
yields this output: https://gist.github.com/yurinnick/bf06c9d0f97f9c361b38038f5fe8e072
FWIW, omitting build-system
is definitely valid, and passing -v
to pip will present the build output. There is no need to use build
for it.
Works fine for me on Windows:
❯ pip install -vvv -e ".[dev]"
Using pip 23.2.1 from C:\Work\Scratch\foof\.venv\Lib\site-packages\pip (python 3.11)
Non-user install because user site-packages disabled
...
Successfully installed certifi-2023.7.22 charset-normalizer-3.2.0 colorama-0.4.6 idna-3.4 iniconfig-2.0.0 packaging-23.1 pluggy-1.3.0 pytest-7.4.0 requests-2.31.0 unicode-bug-0.1.0 urllib3-2.0.4
Remote version of pip: 23.2.1
Local version of pip: 23.2.1
Was pip installed by pip? True
Removed build tracker: 'C:\\Users\\xxx\\AppData\\Local\\Temp\\pip-build-tracker-7gbkreko'
❯ pip list
Package Version Editable project location
------------------ --------- -------------------------
certifi 2023.7.22
charset-normalizer 3.2.0
colorama 0.4.6
idna 3.4
iniconfig 2.0.0
packaging 23.1
pip 23.2.1
pluggy 1.3.0
pytest 7.4.0
requests 2.31.0
unicode-bug 0.1.0 C:\Work\Scratch\foof
urllib3 2.0.4
(note pytest is installed, so the dev
extra was recognised).
Verbose pip install output: https://gist.github.com/yurinnick/9128d68e5f9be9b0b50362122d79346c
I did some research and my findings so far:
pip._internal.metadata.importlib._dists.Distribution
classDistribution
loaded from unicode_bug.egg_info
directory has valid metadata [1]Discribution
loaded from temporary /tmp/pip-modern-metadata-0q88s1hj/unicode_bug-0.1.0.dist-info
has invalid metadata [2][1]
>>> d.metadata.values()
['2.1', 'unicode-bug', '0.1.0', '\n ', '>=3.10', 'dev', 'LICENSE', 'requests', 'pytest ; extra == "dev"']
>>> d.metadata.keys()
['Metadata-Version', 'Name', 'Version', 'License', 'Requires-Python', 'Provides-Extra', 'License-File', 'Requires-Dist', 'Requires-Dist']
[2]
[('Metadata-Version', '2.1'), ('Name', 'unicode-bug'), ('Version', '0.1.0'), ('License', ''), ('Description', " \nRequires-Python: >=3.10\nLicense-File: LICENSE\nRequires-Dist: requests\nProvides-Extra: dev\nRequires-Dist: pytest ; extra == 'dev'\n\n")]
This sounds like a setuptools bug to me. The dist-info directory in the temporary directory is generated by the build backend (setuptools) and pip only reads the result.
This is definitely a setuptools bug. Transferring the issue over. :)
Hi @yurinnick thank you very much for reporting, I will have a look on this.
Distribution loaded from unicode_bug.egg_info directory has valid metadata [1]\nDiscribution loaded from temporary /tmp/pip-modern-metadata-0q88s1hj/unicode_bug-0.1.0.dist-info has invalid metadata [2]
This is interesting. Is the PKG-INFO file in the top of the sdist OK? I had a quick view in the code and it seems that we do use UTF-8 to read the file pointed out in pyproject.toml
. And then we use UTF-8 again to write PKG-INFO.
One thing that might be related is that setuptools produces PKG-INFO
files and those are later transformed into METADATA
by pypa/wheel
. But then pypa/wheel
also seems to deal with the files using UTF-8.
The code contains some places that we use email.message_from_file
(via distutils) to read PKG-INFO
when building from the sdist
, this might be where UTF-8 is not explicitly used. But I am not sure this is used, I need to investigate better.
If you build the project skipping the sdist
step (e.g. python -m build --wheel
) is the metadata file still ill-formed?
PKG-INFO
in egg-info
directory has the unicode character, but gets parsed correctly.
LICENSE
file from dist-info
directory also contains unicode, but METADATA
file doesn't.
Here are the files in the hex form: https://gist.github.com/yurinnick/94fc3d9b19e0bf6270f7dd332f4b0e61
One thing confuses me. You keep referring to a "unicode character", but the reproducer you gave (which, as I said, works fine for me on Windows) uses the character \U000b
, which is a perfectly ordinary ASCII character - 'LINE TABULATION' (U+000B).
Is the problem here caused by the fact that this character is whitespace, and as such is getting stripped at some point due to something removing trailing whitespace from lines?
The one in the license I believe is \U000c
, I may mixed something up during my experiments though. I am not knowledgeable in encodings, so I assumed it's a unicode symbol: https://www.fileformat.info/info/unicode/char/000c/index.htm
I added some debug statements to see when the transformation is happening and it seems to come from https://github.com/pypa/wheel/blob/0.41.2/src/wheel/bdist_wheel.py#L587:
...
serialization_policy = EmailPolicy( # email.policy.EmailPolicy
utf8=True,
mangle_from_=False,
max_line_length=0,
)
with open(pkg_info_path, "w", encoding="utf-8") as out:
Generator(out, policy=serialization_policy).flatten(pkg_info) # email.generator.Generator
...
This is how I obtained the info:
And this is the output (when I run python -m build --no-isolation
[^2]):
...
*** __file__='/tmp/myproj/.venv/lib64/python3.10/site-packages/wheel/bdist_wheel.py' sys._getframe().f_code.co_name='egg2dist' ***
---------------------PKG-INFO-------------------------->>
b'Metadata-Version: 2.1\nName: unicode-bug\nVersion: 0.1.0\nLicense: \x0b\n \nRequires-Python: >=3.10\nProvides-Extra: dev\nLicense-File: LICENSE\n'
<<---------------------PKG-INFO--------------------------
---------------------LICENSE-------------------------->>
metadata['License']='\x0b\n '
<<---------------------LICENSE--------------------------
---------------------METADATA-------------------------->>
b"Metadata-Version: 2.1\nName: unicode-bug\nVersion: 0.1.0\nLicense: \n\n \nRequires-Python: >=3.10\nLicense-File: LICENSE\nRequires-Dist: requests\nProvides-Extra: dev\nRequires-Dist: pytest ; extra == 'dev'\n\n"
<<---------------------METADATA--------------------------
...
Now, I am not sure why stdlib's email.policy.EmailPolicy
is replacing the UTF-8 character if utf8=True
... Maybe that is part of the spec in the email message RFC? I am not sure how to tackle this.
Maybe it is worth to discuss with pypa/wheel
?
Do the pip
maintainers have any suggestions?
[^2]: I use --no-isolation
to force the "patched" version of wheel
to be used instead of build
downloading a new one.
Something very similar happens if you replace \U000b
with \U000c
.
Hang on. License-Path
isn't valid metadata. So presumably it's a setuptools-specific thing, and bdist_wheel is trying to make it into a valid value for the License
field?
It looks like bdist_wheel is failing to "properly"[^1] escape that value, resulting in a metadata file with an unescaped \n\n
in the license field, which is then treated as the end of the headers with everything following being dumped into the description (because it's now the message body).
[^1]: I have no idea what proper escaping would be for a multi-line value, it's not at all well-defined if I recall correctly.
The License-File
is left unprocessed (which is unspecified, but was implemented in setuptools by a contribution in the context of PEP 639 a few years ago[^1]).
The License
field is already originally present in the PKG-INFO
file (and it contains the UTF-8 char).
When project.licence.file
is given in pyproject.toml
, setuptools
will read the file (using UTF-8) and populate the License
core-metadata (that seems to be what is meant to happen by PEP 621).
[^1]: I initially thought it was a contribution of the original author of PEP 639, but I was mistaken.
@agronholm do you have any thoughts in the matter? (see https://github.com/pypa/setuptools/issues/4033#issuecomment-1698853282).
The escaping behaviour seems to be inherited by bdist_wheel
from the standard library, but I am not sure if it is intentional or accidental (in the stdlib, since utf8=True
is passed to the EmailPolicy
)
Ah. I missed that. Looks like it's translating \x0b to
\n` in that case.
Yes:
from email.policy import EmailPolicy
policy = EmailPolicy(utf8=True, mangle_from_=False, max_line_length=0)
print(repr(policy.fold("License", "a\x0bb\n")))
produces
❯ py .\em.py
'License: a\nb\n'
So that policy doesn't handle arbitrary control characters correctly...
It's worth noting that the specification is (deliberately) vague on how to serialise metadata to a file:
However, email formats have been revised several times, and exactly which email RFC applies to packaging metadata is not specified. In the absence of a precise definition, the practical standard is set by what the standard library email.parser module can parse using the compat32 policy.
So there's no formal answer to the question "how do we handle data like this?" - other than "don't do that, then" 🙁
The following is worth noting:
>>> m = email.message_from_string("Foo: a\x0b\x0bc\nBar: 12\n\nBody")
>>> m.as_string()
'Foo: a\nc\nBar: 12\n\nBody'
>>> m["Foo"]
'a\x0b\x0bc'
The email module can parse data where header values contain values like \x0b
, but it can't produce them. So it doesn't round-trip correctly.
One minor comment is that the compat32
policy will do a different escaping that could potentially work (I haven't tested)[^1]. The compat32
is however a bit problematic because it does not support UTF-8.
If we replace the EmailPolicy(utf8=True, ...)
with compat32
we are likely to start receiving reports of people's names/readmes not rendering properly on PyPI...
[^1]: It still removes \x0b
but it does not seem to add a second \n
:
print(repr(compat32.fold("License", '\x0b\n '))) => 'License: \n'
Like it or not, compat32 (for reading) is what the standard says. I omitted an additional part of the spec:
Whenever metadata is serialised to a byte stream (for example, to save to a file), strings must be serialised using the UTF-8 encoding.
And what does (mostly) work is to read the file using UTF-8 into a string, and then parse with email.message_from_string
.
>>> m = email.message_from_string("Name: André")
>>> m["Name"]
'André'
You can't round-trip like this, and the email module (and the spec!) gives you little or no help in writing such a metadata file, so I agree that bdist_wheel
has a problem with creating a file like this, but if you can create one, it's readable.
IMO, the temporary solution here is probably for setuptools to just refuse to allow control characters in a license file. The UTF-8 question is something of a red herring here, as the current behaviour with UTF-8 is fine, it's only control character handling that seems broken.
Longer term, license file handling will have to be dealt with as part of PEP 639, and if the problem of writing arbitrary user-entered text into metadata values other than license and description becomes a problem, someone will need to bite the bullet and propose a standard for how to encode such data (or we switch to a different metadata serialisation format, like JSON...)
So is there some metadata getting used by wheel that should be omitted?
@agronholm The problem is basically that if PKG-INFO
contains text like b"License: \x0b\n \n"
, bdist_wheel
isn't encoding it correctly (i.e., in a way that can be read back).
This is because neither the metadata spec nor the email standards really say what to do with control characters in header values, and the stdlib email
package appears to handle it inconsistently.
My recommendation is to reject characters like \x0b
- either just fail with a helpful error, or strip them out somehow.
Are newlines still allowed though?
This is because neither the metadata spec nor the email standards really say what to do with control characters in header values,
The current stdlib implementation uses .splitlines()
in the header value and then re-joins it. Some control characters will be interpreted as a newline:
>>> import itertools, pprint
>>> pprint.pprint({f"0x{i:02x}": (chr(i), f"{chr(i)}\n".splitlines()) for i in itertools.chain(range(0x1f), [0x7f])})
{'0x00': ('\x00', ['\x00']),
'0x01': ('\x01', ['\x01']),
'0x02': ('\x02', ['\x02']),
'0x03': ('\x03', ['\x03']),
'0x04': ('\x04', ['\x04']),
'0x05': ('\x05', ['\x05']),
'0x06': ('\x06', ['\x06']),
'0x07': ('\x07', ['\x07']),
'0x08': ('\x08', ['\x08']),
'0x09': ('\t', ['\t']),
'0x0a': ('\n', ['', '']),
'0x0b': ('\x0b', ['', '']),
'0x0c': ('\x0c', ['', '']),
'0x0d': ('\r', ['']),
'0x0e': ('\x0e', ['\x0e']),
'0x0f': ('\x0f', ['\x0f']),
'0x10': ('\x10', ['\x10']),
'0x11': ('\x11', ['\x11']),
'0x12': ('\x12', ['\x12']),
'0x13': ('\x13', ['\x13']),
'0x14': ('\x14', ['\x14']),
'0x15': ('\x15', ['\x15']),
'0x16': ('\x16', ['\x16']),
'0x17': ('\x17', ['\x17']),
'0x18': ('\x18', ['\x18']),
'0x19': ('\x19', ['\x19']),
'0x1a': ('\x1a', ['\x1a']),
'0x1b': ('\x1b', ['\x1b']),
'0x1c': ('\x1c', ['', '']),
'0x1d': ('\x1d', ['', '']),
'0x1e': ('\x1e', ['', '']),
'0x7f': ('\x7f', ['\x7f'])}
Ran into this issue; it's of note that some standard LICENSE texts will contain characters that end up being incorrectly escaped thus generating a PKG-INFO that's 'broken'.
Description
Some licenses, in this specific example, LGPL-2.1, contains Unicode characters. In case of LGPL 2.1 it is
U+000c
- Form Feed character.In
pyproject.toml
having a license file with those characters silently breaks package dependencies installation.Expected behavior
pip install -e .
installs package dependenciespip install -e .[dev]
installs optionaldev
dependencies grouppip version
23.2.1
Python version
3.10, 3.11
OS
Fedora Linux 38
How to Reproduce
Create
pyproject.toml
Create
LICENSE
file with unicode characterTry to install the package and optional dependencies - doesn't install dependencies
$ pip install -e .[dev] ... WARNING: unicode-bug 0.1.0 does not provide the extra 'dev' ...
$ echo "" > LICENSE
$ pip install -e . ... Successfully built unicode-bug Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests, unicode-bug Attempting uninstall: unicode-bug Found existing installation: unicode-bug 0.1.0 Uninstalling unicode-bug-0.1.0: Successfully uninstalled unicode-bug-0.1.0 Successfully installed certifi-2023.7.22 charset-normalizer-3.2.0 idna-3.4 requests-2.31.0 unicode-bug-0.1.0 urllib3-2.0.4
$ pip install -e .[dev] Successfully built unicode-bug Installing collected packages: pluggy, packaging, iniconfig, unicode-bug, pytest Attempting uninstall: unicode-bug Found existing installation: unicode-bug 0.1.0 Uninstalling unicode-bug-0.1.0: Successfully uninstalled unicode-bug-0.1.0 Successfully installed iniconfig-2.0.0 packaging-23.1 pluggy-1.3.0 pytest-7.4.0 unicode-bug-0.1.0