[BUG] `pyproject.toml`: Unicode character in a license file prevents dependencies from being installed

yurinnick commented 1 year ago

Description

Some licenses, in this specific example, LGPL-2.1, contains Unicode characters. In case of LGPL 2.1 it is U+000c - Form Feed character.

In pyproject.toml having a license file with those characters silently breaks package dependencies installation.

Expected behavior

pip install -e . installs package dependencies
pip install -e .[dev] installs optional dev dependencies group

pip version

23.2.1

Python version

3.10, 3.11

OS

Fedora Linux 38

How to Reproduce

Create pyproject.toml

[project]
name = "unicode-bug"
version = "0.1.0"
requires-python = ">=3.10"
license = {file = "LICENSE"}
dependencies = [
"requests"
]
[project.optional-dependencies]
dev = [
"pytest"
]

Create LICENSE file with unicode character
```
$ echo -e '\U000b' > LICENSE
```

Try to install the package and optional dependencies - doesn't install dependencies


$ pip install -e .
...
Successfully built unicode-bug
Installing collected packages: unicode-bug
Attempting uninstall: unicode-bug
Found existing installation: unicode-bug 0.1.0
Uninstalling unicode-bug-0.1.0:
  Successfully uninstalled unicode-bug-0.1.0
Successfully installed unicode-bug-0.1.0

$ pip install -e .[dev] ... WARNING: unicode-bug 0.1.0 does not provide the extra 'dev' ...


4. Remove unicode character

$ echo "" > LICENSE


3. Try to install the package  and optional dependencies - installation works

$ pip install -e . ... Successfully built unicode-bug Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests, unicode-bug Attempting uninstall: unicode-bug Found existing installation: unicode-bug 0.1.0 Uninstalling unicode-bug-0.1.0: Successfully uninstalled unicode-bug-0.1.0 Successfully installed certifi-2023.7.22 charset-normalizer-3.2.0 idna-3.4 requests-2.31.0 unicode-bug-0.1.0 urllib3-2.0.4

$ pip install -e .[dev] Successfully built unicode-bug Installing collected packages: pluggy, packaging, iniconfig, unicode-bug, pytest Attempting uninstall: unicode-bug Found existing installation: unicode-bug 0.1.0 Uninstalling unicode-bug-0.1.0: Successfully uninstalled unicode-bug-0.1.0 Successfully installed iniconfig-2.0.0 packaging-23.1 pluggy-1.3.0 pytest-7.4.0 unicode-bug-0.1.0



### Output

_No response_

### Code of Conduct

- [X] I agree to follow the [PSF Code of Conduct](https://www.python.org/psf/conduct/).

RonnyPfannschmidt commented 1 year ago

What's the output when using a valid build-system section, and what warnings get shown when using build to make the wheel?

yurinnick commented 1 year ago

Add build-system section:

...
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

Install and build the wheel with broken license file:

$ pip install -e .
Obtaining file:///home/yurinnick/Documents/test
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Building wheels for collected packages: unicode-bug
  Building editable for unicode-bug (pyproject.toml) ... done
  Created wheel for unicode-bug: filename=unicode_bug-0.1.0-0.editable-py3-none-any.whl size=2922 sha256=4277b75c0ebbf7978dc5a79089f9d5b8cce84d2179e10aa837c2069139ff977d
  Stored in directory: /tmp/pip-ephem-wheel-cache-6sqe4pu4/wheels/5e/bb/78/14fd4ec013baed754dd266aec43868a061011260e628caafc4
Successfully built unicode-bug
Installing collected packages: unicode-bug
  Attempting uninstall: unicode-bug
    Found existing installation: unicode-bug 0.1.0
    Uninstalling unicode-bug-0.1.0:
      Successfully uninstalled unicode-bug-0.1.0
Successfully installed unicode-bug-0.1.0

$ pip wheel .
Processing /home/yurinnick/Documents/test
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: unicode-bug
  Building wheel for unicode-bug (pyproject.toml) ... done
  Created wheel for unicode-bug: filename=unicode_bug-0.1.0-py3-none-any.whl size=1425 sha256=ff6fb74aa5124491036b8ab8f8788910f7a6ba9b5d65285db8568f45e99975a9
  Stored in directory: /tmp/pip-ephem-wheel-cache-mk_5fqz9/wheels/5e/bb/78/14fd4ec013baed754dd266aec43868a061011260e628caafc4
Successfully built unicode-bug

So no new warnings or errors as far as I can tell.

RonnyPfannschmidt commented 1 year ago

I meant the build package,which has easier opt out of build output hiding

yurinnick commented 1 year ago

python -m build yields this output: https://gist.github.com/yurinnick/bf06c9d0f97f9c361b38038f5fe8e072

pradyunsg commented 1 year ago

FWIW, omitting build-system is definitely valid, and passing -v to pip will present the build output. There is no need to use build for it.

pfmoore commented 1 year ago

Works fine for me on Windows:

❯ pip install -vvv -e ".[dev]"
Using pip 23.2.1 from C:\Work\Scratch\foof\.venv\Lib\site-packages\pip (python 3.11)
Non-user install because user site-packages disabled
...
Successfully installed certifi-2023.7.22 charset-normalizer-3.2.0 colorama-0.4.6 idna-3.4 iniconfig-2.0.0 packaging-23.1 pluggy-1.3.0 pytest-7.4.0 requests-2.31.0 unicode-bug-0.1.0 urllib3-2.0.4
Remote version of pip: 23.2.1
Local version of pip:  23.2.1
Was pip installed by pip? True
Removed build tracker: 'C:\\Users\\xxx\\AppData\\Local\\Temp\\pip-build-tracker-7gbkreko'

❯ pip list
Package            Version   Editable project location
------------------ --------- -------------------------
certifi            2023.7.22
charset-normalizer 3.2.0
colorama           0.4.6
idna               3.4
iniconfig          2.0.0
packaging          23.1
pip                23.2.1
pluggy             1.3.0
pytest             7.4.0
requests           2.31.0
unicode-bug        0.1.0     C:\Work\Scratch\foof
urllib3            2.0.4

(note pytest is installed, so the dev extra was recognised).

yurinnick commented 1 year ago

Verbose pip install output: https://gist.github.com/yurinnick/9128d68e5f9be9b0b50362122d79346c

yurinnick commented 1 year ago

I did some research and my findings so far:

The problem should be somewhere in the metadata parsing logic in pip._internal.metadata.importlib._dists.Distribution class
Distribution loaded from unicode_bug.egg_info directory has valid metadata [1]
Discribution loaded from temporary /tmp/pip-modern-metadata-0q88s1hj/unicode_bug-0.1.0.dist-info has invalid metadata [2]

[1]

>>> d.metadata.values()
['2.1', 'unicode-bug', '0.1.0', '\n        ', '>=3.10', 'dev', 'LICENSE', 'requests', 'pytest ; extra == "dev"']
>>> d.metadata.keys()
['Metadata-Version', 'Name', 'Version', 'License', 'Requires-Python', 'Provides-Extra', 'License-File', 'Requires-Dist', 'Requires-Dist']

[2]

[('Metadata-Version', '2.1'), ('Name', 'unicode-bug'), ('Version', '0.1.0'), ('License', ''), ('Description', "        \nRequires-Python: >=3.10\nLicense-File: LICENSE\nRequires-Dist: requests\nProvides-Extra: dev\nRequires-Dist: pytest ; extra == 'dev'\n\n")]

uranusjr commented 1 year ago

This sounds like a setuptools bug to me. The dist-info directory in the temporary directory is generated by the build backend (setuptools) and pip only reads the result.

pradyunsg commented 1 year ago

This is definitely a setuptools bug. Transferring the issue over. :)

abravalheri commented 1 year ago

Hi @yurinnick thank you very much for reporting, I will have a look on this.

Distribution loaded from unicode_bug.egg_info directory has valid metadata [1]\nDiscribution loaded from temporary /tmp/pip-modern-metadata-0q88s1hj/unicode_bug-0.1.0.dist-info has invalid metadata [2]

This is interesting. Is the PKG-INFO file in the top of the sdist OK? I had a quick view in the code and it seems that we do use UTF-8 to read the file pointed out in pyproject.toml. And then we use UTF-8 again to write PKG-INFO.

One thing that might be related is that setuptools produces PKG-INFO files and those are later transformed into METADATA by pypa/wheel. But then pypa/wheel also seems to deal with the files using UTF-8.

The code contains some places that we use email.message_from_file (via distutils) to read PKG-INFO when building from the sdist, this might be where UTF-8 is not explicitly used. But I am not sure this is used, I need to investigate better.

If you build the project skipping the sdist step (e.g. python -m build --wheel) is the metadata file still ill-formed?

yurinnick commented 1 year ago

PKG-INFO in egg-info directory has the unicode character, but gets parsed correctly. LICENSE file from dist-info directory also contains unicode, but METADATA file doesn't.

Here are the files in the hex form: https://gist.github.com/yurinnick/94fc3d9b19e0bf6270f7dd332f4b0e61

pfmoore commented 1 year ago

One thing confuses me. You keep referring to a "unicode character", but the reproducer you gave (which, as I said, works fine for me on Windows) uses the character \U000b, which is a perfectly ordinary ASCII character - 'LINE TABULATION' (U+000B).

Is the problem here caused by the fact that this character is whitespace, and as such is getting stripped at some point due to something removing trailing whitespace from lines?

yurinnick commented 1 year ago

The one in the license I believe is \U000c, I may mixed something up during my experiments though. I am not knowledgeable in encodings, so I assumed it's a unicode symbol: https://www.fileformat.info/info/unicode/char/000c/index.htm

abravalheri commented 1 year ago

I added some debug statements to see when the transformation is happening and it seems to come from https://github.com/pypa/wheel/blob/0.41.2/src/wheel/bdist_wheel.py#L587:

...
       serialization_policy = EmailPolicy(  # email.policy.EmailPolicy
            utf8=True,
            mangle_from_=False,
            max_line_length=0,
        )
        with open(pkg_info_path, "w", encoding="utf-8") as out:
            Generator(out, policy=serialization_policy).flatten(pkg_info)  # email.generator.Generator
...

This is how I obtained the info:

(patch format)

```diff diff .venv/lib/python3.10/site-packages/wheel/orig_bdist_wheel.py .venv/lib/python3.10/site-packages/wheel/bdist_wheel.py 558a559,563 > with open(pkginfo_path, "rb") as fp: > print(f"\n\n*** {__file__=} {sys._getframe().f_code.co_name=} ***") > print("---------------------PKG-INFO-------------------------->>") > print(repr(fp.read())) > print("<<---------------------PKG-INFO--------------------------") 559a565,569 > metadata = pkg_info > > print("\n\n---------------------LICENSE-------------------------->>") > print(f"{metadata['License']=!r}") > print("<<---------------------LICENSE--------------------------") 587a598,602 > > with open(pkg_info_path, "rb") as fp: > print("\n\n---------------------METADATA-------------------------->>") > print(repr(fp.read())) > print("<<---------------------METADATA--------------------------\n\n") ```

And this is the output (when I run python -m build --no-isolation[^2]):

...

*** __file__='/tmp/myproj/.venv/lib64/python3.10/site-packages/wheel/bdist_wheel.py' sys._getframe().f_code.co_name='egg2dist' ***
---------------------PKG-INFO-------------------------->>
b'Metadata-Version: 2.1\nName: unicode-bug\nVersion: 0.1.0\nLicense: \x0b\n        \nRequires-Python: >=3.10\nProvides-Extra: dev\nLicense-File: LICENSE\n'
<<---------------------PKG-INFO--------------------------

---------------------LICENSE-------------------------->>
metadata['License']='\x0b\n        '
<<---------------------LICENSE--------------------------

---------------------METADATA-------------------------->>
b"Metadata-Version: 2.1\nName: unicode-bug\nVersion: 0.1.0\nLicense: \n\n        \nRequires-Python: >=3.10\nLicense-File: LICENSE\nRequires-Dist: requests\nProvides-Extra: dev\nRequires-Dist: pytest ; extra == 'dev'\n\n"
<<---------------------METADATA--------------------------

...

Now, I am not sure why stdlib's email.policy.EmailPolicy is replacing the UTF-8 character if utf8=True... Maybe that is part of the spec in the email message RFC? I am not sure how to tackle this.

Maybe it is worth to discuss with pypa/wheel? Do the pip maintainers have any suggestions?

[^2]: I use --no-isolation to force the "patched" version of wheel to be used instead of build downloading a new one.

Something very similar happens if you replace \U000b with \U000c.

pfmoore commented 1 year ago

Hang on. License-Path isn't valid metadata. So presumably it's a setuptools-specific thing, and bdist_wheel is trying to make it into a valid value for the License field?

It looks like bdist_wheel is failing to "properly"[^1] escape that value, resulting in a metadata file with an unescaped \n\n in the license field, which is then treated as the end of the headers with everything following being dumped into the description (because it's now the message body).

[^1]: I have no idea what proper escaping would be for a multi-line value, it's not at all well-defined if I recall correctly.

abravalheri commented 1 year ago

The License-File is left unprocessed (which is unspecified, but was implemented in setuptools by a contribution in the context of PEP 639 a few years ago[^1]).

The License field is already originally present in the PKG-INFO file (and it contains the UTF-8 char). When project.licence.file is given in pyproject.toml, setuptools will read the file (using UTF-8) and populate the License core-metadata (that seems to be what is meant to happen by PEP 621).

[^1]: I initially thought it was a contribution of the original author of PEP 639, but I was mistaken.

abravalheri commented 1 year ago

@agronholm do you have any thoughts in the matter? (see https://github.com/pypa/setuptools/issues/4033#issuecomment-1698853282).

The escaping behaviour seems to be inherited by bdist_wheel from the standard library, but I am not sure if it is intentional or accidental (in the stdlib, since utf8=True is passed to the EmailPolicy)

pfmoore commented 1 year ago

Ah. I missed that. Looks like it's translating \x0b to\n` in that case.

Yes:

from email.policy import EmailPolicy
policy = EmailPolicy(utf8=True, mangle_from_=False, max_line_length=0)
print(repr(policy.fold("License", "a\x0bb\n")))

produces

❯ py .\em.py
'License: a\nb\n'

So that policy doesn't handle arbitrary control characters correctly...

It's worth noting that the specification is (deliberately) vague on how to serialise metadata to a file:

However, email formats have been revised several times, and exactly which email RFC applies to packaging metadata is not specified. In the absence of a precise definition, the practical standard is set by what the standard library email.parser module can parse using the compat32 policy.

So there's no formal answer to the question "how do we handle data like this?" - other than "don't do that, then" 🙁

pfmoore commented 1 year ago

The following is worth noting:

>>> m = email.message_from_string("Foo: a\x0b\x0bc\nBar: 12\n\nBody") 
>>> m.as_string()
'Foo: a\nc\nBar: 12\n\nBody'
>>> m["Foo"]
'a\x0b\x0bc'

The email module can parse data where header values contain values like \x0b, but it can't produce them. So it doesn't round-trip correctly.

abravalheri commented 1 year ago

One minor comment is that the compat32 policy will do a different escaping that could potentially work (I haven't tested)[^1]. The compat32 is however a bit problematic because it does not support UTF-8.

If we replace the EmailPolicy(utf8=True, ...) with compat32 we are likely to start receiving reports of people's names/readmes not rendering properly on PyPI...

[^1]: It still removes \x0b but it does not seem to add a second \n: print(repr(compat32.fold("License", '\x0b\n '))) => 'License: \n'

pfmoore commented 1 year ago

Like it or not, compat32 (for reading) is what the standard says. I omitted an additional part of the spec:

Whenever metadata is serialised to a byte stream (for example, to save to a file), strings must be serialised using the UTF-8 encoding.

And what does (mostly) work is to read the file using UTF-8 into a string, and then parse with email.message_from_string.

>>> m = email.message_from_string("Name: André")                      
>>> m["Name"]
'André'

You can't round-trip like this, and the email module (and the spec!) gives you little or no help in writing such a metadata file, so I agree that bdist_wheel has a problem with creating a file like this, but if you can create one, it's readable.

IMO, the temporary solution here is probably for setuptools to just refuse to allow control characters in a license file. The UTF-8 question is something of a red herring here, as the current behaviour with UTF-8 is fine, it's only control character handling that seems broken.

Longer term, license file handling will have to be dealt with as part of PEP 639, and if the problem of writing arbitrary user-entered text into metadata values other than license and description becomes a problem, someone will need to bite the bullet and propose a standard for how to encode such data (or we switch to a different metadata serialisation format, like JSON...)

agronholm commented 1 year ago

So is there some metadata getting used by wheel that should be omitted?

pfmoore commented 1 year ago

@agronholm The problem is basically that if PKG-INFO contains text like b"License: \x0b\n \n", bdist_wheel isn't encoding it correctly (i.e., in a way that can be read back).

This is because neither the metadata spec nor the email standards really say what to do with control characters in header values, and the stdlib email package appears to handle it inconsistently.

My recommendation is to reject characters like \x0b - either just fail with a helpful error, or strip them out somehow.

agronholm commented 1 year ago

Are newlines still allowed though?

abravalheri commented 1 year ago

This is because neither the metadata spec nor the email standards really say what to do with control characters in header values,

The current stdlib implementation uses .splitlines() in the header value and then re-joins it. Some control characters will be interpreted as a newline:

>>> import itertools, pprint
>>> pprint.pprint({f"0x{i:02x}": (chr(i), f"{chr(i)}\n".splitlines()) for i in itertools.chain(range(0x1f), [0x7f])})
{'0x00': ('\x00', ['\x00']),
 '0x01': ('\x01', ['\x01']),
 '0x02': ('\x02', ['\x02']),
 '0x03': ('\x03', ['\x03']),
 '0x04': ('\x04', ['\x04']),
 '0x05': ('\x05', ['\x05']),
 '0x06': ('\x06', ['\x06']),
 '0x07': ('\x07', ['\x07']),
 '0x08': ('\x08', ['\x08']),
 '0x09': ('\t', ['\t']),
 '0x0a': ('\n', ['', '']),
 '0x0b': ('\x0b', ['', '']),
 '0x0c': ('\x0c', ['', '']),
 '0x0d': ('\r', ['']),
 '0x0e': ('\x0e', ['\x0e']),
 '0x0f': ('\x0f', ['\x0f']),
 '0x10': ('\x10', ['\x10']),
 '0x11': ('\x11', ['\x11']),
 '0x12': ('\x12', ['\x12']),
 '0x13': ('\x13', ['\x13']),
 '0x14': ('\x14', ['\x14']),
 '0x15': ('\x15', ['\x15']),
 '0x16': ('\x16', ['\x16']),
 '0x17': ('\x17', ['\x17']),
 '0x18': ('\x18', ['\x18']),
 '0x19': ('\x19', ['\x19']),
 '0x1a': ('\x1a', ['\x1a']),
 '0x1b': ('\x1b', ['\x1b']),
 '0x1c': ('\x1c', ['', '']),
 '0x1d': ('\x1d', ['', '']),
 '0x1e': ('\x1e', ['', '']),
 '0x7f': ('\x7f', ['\x7f'])}

supakeen commented 11 months ago

Ran into this issue; it's of note that some standard LICENSE texts will contain characters that end up being incorrectly escaped thus generating a PKG-INFO that's 'broken'.

pypa / setuptools