njsmith / posy

289 stars 17 forks source link

Ignore sdists with malformed names #9

Open njsmith opened 1 year ago

njsmith commented 1 year ago

When reading https://pypi.org/simple/cffi, we currently see cffi-1.0.2-2.tar.gz and parse it as name: cffi-1.0.2, version: 2. And then in PackageDB::available_artifacts("cffi"), we end up filing this under version 2.

I don't think we can parse this sdist name in general -- at least without breaking much more common cases like scikit-learn-1.0.2.tar.gz. But a very simple thing we could do is, when reading a simple API page, ignore all entries whose name doesn't match the simple API page we're looking at!

(I guess we could also get fancier, and try to use the simple API page to bias the sdist name parsing? But I think stuff like cffi-1.0.2-2.tar.gz is super rare and we can probably just skip it.)

encukou commented 1 year ago

See PEP 625. The sdist filename was standardized 2 years ago, so you can parse it. There should be only one dash, since the name and versions should be normalized. There are stragglers, and historical releases won't be fixed, but a new tool should be OK with simply ignoring those -- though it does need to detect them. Apparently the overwhelming majority of legacy filenames contain multiple dashes, so detecting that could be good enough.

encukou commented 1 year ago

Edit: it's PEP 625 -- just in case you're reading the mail notification.

njsmith commented 1 year ago

Ah, yeah, that's another option -- skipping any sdist name with multiple dashes. I was assuming that we couldn't drop compat with old non-compliant artifacts, but maybe we could get away with it.