pypa / setuptools

Official project repository for the Setuptools build system
https://pypi.org/project/setuptools/
MIT License
2.48k stars 1.18k forks source link

[BUG] Wheel naming is not following PEP 491 convention #3777

Open bastienqb opened 1 year ago

bastienqb commented 1 year ago

setuptools version

setuptools==66.0.0

Python version

python 3.8

OS

macOS

Additional environment information

No response

Description

I am building a wheel for a python package using setuptools and it seems that the naming of my wheel file is not respecting the PEP 491 convention.

For external reasons, I need to name my package with the structure {namespace}.{package-name}. If I follow the convention, I would expect that my wheel file is named: namespace_package_name-0.1.0-py2.py3-none-any.whl.

However, I get this name for my wheel: namespace.package_name-0.1.0-py2.py3-none-any.whl, which is not respecting the convention.

Expected behavior

I expect the "." in the package name to be replaced with a "_" in the wheel name.

How to Reproduce

  1. create a hello_world package structure:
    hello_world
    |- my_package
    |    |- __init__.py
    |    |- hello_world.py
    |- pyproject.toml
    |- setup.cfg
  2. in setup.cfg, write:

    [metadata]
    name = namespace.my-package
    version = 0.1.0
    
    [options]
    zip_safe = False
    include_package_data = True
    packages = find:
  3. in project.toml, write:
    [build-system]
    requires = ["setuptools==66.0.0", "wheel"]
    build-backend = "setuptools.build_meta"
  4. run pipx run build at the root of your hello_world package
  5. inspect the dist directory which was created

Output

in the dist folder, you find:

jaraco commented 1 year ago

As much as I dislike unnecessary name mangling, this report does appear to be correct. I'm unsure if Setuptools is responsible for the naming or if the wheel package is. Regardless, it probably should be fixed.

@dholth is there any chance the PEP could be updated to allow . characters in the wheel name? What was the motivation for mangling them? They're an intrinsic, important character in the Python package name.

abravalheri commented 1 year ago

@jaraco I belive that the PEP originally allowed for ., but the living spec was changed as a result or a discussion:

https://github.com/pypa/packaging.python.org/pull/844

(Although after a quick look on the Discourse thread, it looks to me that the . character, specifically, was not really debated and ended up accidentally changing)

mgorny commented 1 year ago

The current spec seems to say that full name normalization should happen (i.e. lower case + runs of special chars to underscore), and from my quick test "newer" backends all follow that.

From distribution viewpoint, we'd also prefer setuptools following, as otherwise we end up with unpredictable filenames (with some backends producing normalized names with others not).

mgorny commented 1 year ago

I'm unsure if Setuptools is responsible for the naming or if the wheel package is.

Apparently wheel is, with wheel_dist_name() function in bdist_wheel.py. There's https://github.com/pypa/wheel/issues/440 which seems to tackle this, though the bug title talks of .dist-info naming.

abravalheri commented 1 year ago

Hi @jaraco, there was a discussion recently on the Python discourse about the normalisation of the distribution file names https://discuss.python.org/t/change-in-pypi-upload-behavior-intentional-accidental-pebkac/27707. I will try to summarise the key takeaways I found on why most of the community seems to be in favour of the normalisation. Hopefully this answers the question "What was the motivation for mangling them?":

  1. PyPI (as a public package index) has very strong reasons for enforcing strict uniqueness checks (security reasons, competition between publishers that might confuse users, etc…). Therefore it is not viable to differentiate between distributions named after “normal packages” and namespace packages on PyPI.
  2. pip, whose primary use case is to download from PyPI, prefers to rule out the possibility of treating distributions named after namespace packages and “normal packages” as two different distributions. This is compatible with PyPI and also helps users to fix unintentional typing errors and avoid downloading wrong/malicious distributions.
  3. Members of the community defend that having one normalisation rule to be applied everywhere would be simpler.
  4. There is some advantage in normalising the .dist-info/.egg-info directory (faster lookup), and if I understood correct this would also help to optimise the checks for conflicting distributions already installed (since .dist-info serves as a database).
  5. Private indexes have to follow PEP 503 and do name normalisation. So it is not possible for distributions named a.b and a_b to coexist in the same private index.
  6. There is some level of agreement that Name in the PKG-INFO/METADATA files should not be normalised and reflect the user's input.

So it seems that the name change unlocks optimisations and simplifications.

@bastienqb, if you would like you can chip in the discussion on https://discuss.python.org/t/change-in-pypi-upload-behavior-intentional-accidental-pebkac/27707 to explain why keeping the names in the format {namespace}.{package-name} is important. Otherwise there seems to be a push in the community for a strict standard that normalises the file name (as a mean to unlock the optimisations and simplifications I mentioned before).

jaraco commented 1 year ago

Unfortunately, I don't think that answers my question - "why is . normalized to _?". They're very clearly different separators and have very different semantic meaning in Python. That is, if a Python user can't tell the difference between those characters, they're already headed for disaster.

Moreover, if the goal is to collapse any characters that a user might find confusing, it suggests that other normalization should occur. By this logic, PyPI should probably also normalize "I" and "l", maybe "j" and "i", "3" and "e", and probably others.

Since there's a strong push toward PyPI names being valid Python identifiers and since "jaraco.collections" and "jaraco_collections" are very much different Python identifiers, I feel strongly that either or both names should be allowed and should be different packages.

I'm very much in support of normalizing for security and to limit the diversity of the namespace and to do that in a way that's largely transparent to the user. What I'd really like to avoid is users seeing "downloading zope_interface" when the package they're downloading is "zope.interface" and the Python package that's installed is zope.interface.

The most important factor here is not to give namespace packages a second-class experience, and that's exactly what they'll get if they follow the convention of naming the package by mapping the Python package to the Distribution package name and the . gets replaced by _ in user-visible locations.

dstufft commented 1 year ago

There's some confusion happening here.

Regardless of what happens, PyPI (and everyone else) is going to treat ., -, and _ as equal characters. This behavior has existed since basically the dawn of time in PyPI, setuptools, pip, etc. This isn't any different than the fact we treat F as equal to f. This is the status quo for ~20 years, and isn't likely going to change.

There's some confusion that came out of some of the specs where the Name field inside of the METADATA some people interpreted that to saying that the Name field should be normalized. I don't believe that there is wide spread support for that, and PyPI does not require that, and I think the people who think that, have essentially just misread the specs, and I'm preparing a PEP that will clarify that the Name field (and thus ultimately the "canonical" name, which should be used in any user visible locations. So when someone looks at the project on PyPI, or whatever it should use the name as it exists in the Name field.

On PyPI we normalize the name in the Simple API URLs only. So for zope.interface the simple API URL is /simple/zope-interface/. We do not consider this a user visible location, it's part of the API contract between an installer and PyPI. From a practical standpoint pip has to be able to take a user entered name and get the URL, and if we didn't do this normalization in the url, then pip install django would fail (because it's Django not django), etc.

The question is largely around filenames. Does zope.interface need to produce a wheel named zope.interface-1.0.whatever.whl, or can it produce a wheel named zope_interface-1.0.whatever.whl. Noting of course that no matter what we choose, a package named foo-bar is never going to have it's name represented exactly perfectly in the wheel.

The specs as they're currently written decide that the filename is not a user facing value, and treats them much like the URLs in the Simple API, an interchange format between computer systems. Of course filenames are also a little more visible than Simple API URLs, they do appear (as filenames) in the PyPI UI, etc.

So ultimately the question is:

  1. Given that zope.interface and zope-interface and zope_interface are all the same name as far as packaging is concerned.
  2. Given that the project's "canonical" name for display is zope.interface.
  3. Given that the PyPI index url for the project is going to be /simple/zope-interface/.

Is it OK for the filenames to be:

or MUST it be:

CAM-Gerlach commented 1 year ago

Unfortunately, I don't think that answers my question - "why is . normalized to _?".

@abravalheri shared your question here on the linked thread, and ended up being convinced by the chorus of responses, so I'll try to summarize the main reasons other PyPA maintainers cited:

However, there was equally strong support for only applying normalization to the identifiers that are not primarily user-facing, i.e. the artifact filenames and the .dist-info, and mandating that the METADATA Name field not be normalized, and that tools should always use that value whenever presenting the project name in a user-facing context (or if they do happen to rely on the distinction). This seems to address your main overriding concern—that the project name be presented to the user as the author intended.

Therefore, it seems a PEP formally declaring that Name MUST NOT be normalized and SHOULD always be what is presented to the user, while also stating the that it MUST be normalized in new archive filenames and .dist-info, would come closest to giving everyone most of what they want here without regressing on the de-facto status quo for either, which as Dustin summarizes on the thread is a mess for everyone involved—especially maintainers with . in their project names, which was actually what kick-started that discussion in the first place.

jaraco commented 1 year ago

I concede. It doesn't matter what the motivation was to consider . and _ equivalent, but they are now by consensus.

CAM-Gerlach commented 1 year ago

Just to be clear, this was only the case for sdist and wheel filenames, dist-info directories and when requesting a package by name from an index—there was also strong consensus that they should not be considered equivalent in the canonical project name, the Name field of pyproject.toml, PKG-INFO and METADATA, and the display name for user consumption, and that should be kept exactly as originally written by the project author.