Open wayphinder opened 1 year ago
Marking this as a bug because this seems to be in violation of the current binary distribution specification, which says:
In distribution names, ...uppercase characters should be replaced with corresponding lowercase ones.
and:
Version numbers should be normalised according to PEP 440.
Before fixing this, we should determine what proportion of recently uploaded filenames would be considered invalid, determine which build backends (if any) are producing invalid filenames, and attempt to fix them to produce normalized filenames if possible.
We should probably also do a standard deprecation/warning period before blocking upload for these.
Apologies for barging in here, but I came across the issue and this part of the question intrigued me:
Before fixing this, we should determine what proportion of recently uploaded filenames would be considered invalid
So took a stab at this through the bigquery instance (I have no clue if this is a complete record of package uploads, but according to the metadata of the table it was updated on October 2nd) and built a simple groupby on project name, version and lowercase filename to select all instances of duplicate filename uploads in 2023. A csv of the query results can be found here.
For reference this is the query:
SELECT b.name, b.version, b.filename from `bigquery-public-data.pypi.distribution_metadata` as b
RIGHT JOIN
(SELECT name, version, COUNT(name) as versions, MIN(filename) as lower_filename FROM `bigquery-public-data.pypi.distribution_metadata`
WHERE
UPLOAD_TIME > TIMESTAMP(DATE "2023-01-01") AND
packagetype = 'bdist_wheel'
GROUP BY name, version, lower(filename)
HAVING versions > 1
) AS a
ON lower(a.lower_filename) = lower(b.filename)
results linked above include false positives eg the first two rows are both "start_ocr-0.0.3-py3-none-any.whl" (with no difference between them) and indeed there is only one such file at https://pypi.org/project/start-ocr/0.0.3/#files
nevertheless this mistake is a thing that really happens, I'm here because I ran across https://pypi.org/project/Pymem/1.13.1/#files, which has both "pymem-1.13.1-py3-none-any.whl" and "Pymem-1.13.1-py3-none-any.whl"
Thanks for doing the analysis! The number of duplicates is quite low, but we should really be checking for the occurrence of invalid filenames, even if there isn't a duplicate. I think there will probably be many more.
What's the problem this feature will solve?
It’s currently possible to upload duplicate distributions like the following
When installing
requests==1.0.0
, Poetry would select the first distribution and Pip the last.Describe the solution you'd like For sdists this will be fixed by https://github.com/pypi/warehouse/issues/12245 and the same type of enforcement can be made for bdists. I.e. enforce the normalization rules for the filenames, as well as verifying that the version in the filename matches the version in the metadata version.
Build numbers would still enable bdist filenames that are otherwise duplicates. I have not looked closely enough at tags to know if they could also be used to make duplicate distributions.
It could be possible to implement a check for duplicate distributions before the normalization rules are enforced for both bdists and sdists.
Additional context Related meta issue: https://github.com/pypi/warehouse/issues/12316