pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.59k stars 965 forks source link

Make bdist filename verification more strict #14602

Open wayphinder opened 1 year ago

wayphinder commented 1 year ago

What's the problem this feature will solve?

It’s currently possible to upload duplicate distributions like the following

Requests-1.0.0-py3-none-any.whl
requests-1.0.0-py3-none-any.whl
requests-01.0.0-py3-none-any.whl

When installing requests==1.0.0, Poetry would select the first distribution and Pip the last.

Describe the solution you'd like For sdists this will be fixed by https://github.com/pypi/warehouse/issues/12245 and the same type of enforcement can be made for bdists. I.e. enforce the normalization rules for the filenames, as well as verifying that the version in the filename matches the version in the metadata version.

Build numbers would still enable bdist filenames that are otherwise duplicates. I have not looked closely enough at tags to know if they could also be used to make duplicate distributions.

It could be possible to implement a check for duplicate distributions before the normalization rules are enforced for both bdists and sdists.

Additional context Related meta issue: https://github.com/pypi/warehouse/issues/12316

di commented 1 year ago

Marking this as a bug because this seems to be in violation of the current binary distribution specification, which says:

In distribution names, ...uppercase characters should be replaced with corresponding lowercase ones.

and:

Version numbers should be normalised according to PEP 440.

Before fixing this, we should determine what proportion of recently uploaded filenames would be considered invalid, determine which build backends (if any) are producing invalid filenames, and attempt to fix them to produce normalized filenames if possible.

We should probably also do a standard deprecation/warning period before blocking upload for these.

TheDutchDevil commented 1 year ago

Apologies for barging in here, but I came across the issue and this part of the question intrigued me:

Before fixing this, we should determine what proportion of recently uploaded filenames would be considered invalid

So took a stab at this through the bigquery instance (I have no clue if this is a complete record of package uploads, but according to the metadata of the table it was updated on October 2nd) and built a simple groupby on project name, version and lowercase filename to select all instances of duplicate filename uploads in 2023. A csv of the query results can be found here.

For reference this is the query:

SELECT b.name, b.version, b.filename from `bigquery-public-data.pypi.distribution_metadata` as b
RIGHT JOIN 

(SELECT name, version, COUNT(name) as versions, MIN(filename) as lower_filename FROM `bigquery-public-data.pypi.distribution_metadata` 
  WHERE 
    UPLOAD_TIME > TIMESTAMP(DATE "2023-01-01") AND
    packagetype = 'bdist_wheel'
GROUP BY name, version, lower(filename)
HAVING versions > 1
) AS a 
ON lower(a.lower_filename) = lower(b.filename)
dimbleby commented 2 months ago

results linked above include false positives eg the first two rows are both "start_ocr-0.0.3-py3-none-any.whl" (with no difference between them) and indeed there is only one such file at https://pypi.org/project/start-ocr/0.0.3/#files

nevertheless this mistake is a thing that really happens, I'm here because I ran across https://pypi.org/project/Pymem/1.13.1/#files, which has both "pymem-1.13.1-py3-none-any.whl" and "Pymem-1.13.1-py3-none-any.whl"

di commented 2 months ago

Thanks for doing the analysis! The number of duplicates is quite low, but we should really be checking for the occurrence of invalid filenames, even if there isn't a duplicate. I think there will probably be many more.