hugovk / top-pypi-packages

A regular dump of the most-downloaded packages from PyPI
https://hugovk.github.io/top-pypi-packages
223 stars 13 forks source link

Use non-lowercased project names #4

Closed jayvdb closed 4 years ago

jayvdb commented 5 years ago

All project names are lower case, not matching the name shown on pypi.org. e.g. pyyaml instead of PyYAML. I suspect that may be the data this project has, in which case the problem is upstream.

That lowercasing is not very helpful - the name of projects can (and does) change over time in all sorts of ways, not just the case.

Applying lowercase can be done after the fact - it is a simple transform, but it is not reversible without the post-processing of all entries as suggested in the followup comments on https://github.com/hugovk/top-pypi-packages/issues/1

My use-case is I need to match the list up with openSUSE package names, which must use the PyPI package name, exactly, including casing and hyphen-vs-dash. The task is slightly more difficult and slower if I dont have the exact name to begin with.

If it cant be obtained from the source data, it is likely quicker for me to add post-processing to get the real name , rather than try to get exact results from case insensitive openSUSE package searches.

hugovk commented 5 years ago

This repo doesn't alter the names, it dumps the result from pypinfo:

/usr/local/bin/pypinfo --json --indent 0 --limit 5000 --days 30 "" project > top-pypi-packages-30-days.json

Having a quick look in pypinfo, it's not changing the name of projects received from the Google BigQuery client.


pypinfo does have this:

def normalize(name):
    """https://www.python.org/dev/peps/pep-0503/#normalized-names"""
    return re.sub(r'[-_.]+', '-', name).lower()

But that's only used for normalising the input when wanting info about a single project, and is blank in this case.

https://www.python.org/dev/peps/pep-0503/#normalized-names says:

This PEP references the concept of a "normalized" project name. As per PEP 426 the only valid characters in a name are the ASCII alphabet, ASCII numbers, ., -, and . The name should be lowercased with all runs of the characters ., -, or replaced with a single - character. This can be implemented in Python with the re module:

(And then gives the same function.)


I didn't check if the Google BigQuery can also return the un-normalised name, if so, that'd need a change to pypinfo before being added here.

If that's not possible or easy, then I'd be fine adding extra data here. Rather than post-processing, I think a second JSON file would be better rather than post-processing.


Or are the openSUSE package names identical to the PyPI names (eg. PyYAML)?

If so, can you normalise PyYAML into pyyaml and then use the data here?

jayvdb commented 5 years ago

Or are the openSUSE package names identical to the PyPI names (eg. PyYAML)?

yes, with a python- prefix.

https://build.opensuse.org/package/show/openSUSE:Factory/python-PyYAML

I would prefer to be using this data first, and looking up against openSUSE, rather than the other way around, or building a database of both and cross referencing.

I'll see what is happening inside pypinfo

jayvdb commented 5 years ago

The schema is at https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema , and both url and file.filename have the proper project name, and I have got them working with adhoc queries. So now I just need to propose a PR to pypinfo to use the filename. It might be slightly slower, depending on whether bigquery supports some more advanced SQL join syntax, and possibly even using https://bigquery.cloud.google.com/table/the-psf:pypi.simple_requests instead.

jayvdb commented 5 years ago

Other potential tools using bigquery which might be usable, especially as some are doing post-processing to get more info from pypi https://github.com/cclauss/python3wos_asyncio & https://github.com/ubershmekel/python3wos, https://github.com/mara/bigquery-downloader , https://github.com/capicue/ncf/blob/master/packages/get-descriptions.py, https://github.com/fmenabe/pypi-stats , https://github.com/psincraian/pepy , https://github.com/ehfeng/installstats , https://github.com/datawrestler/lametric-pypi , https://github.com/OzymandiasTheGreat/pypes , https://github.com/rth/pypi-stats-viz , https://github.com/okfn/measure , https://github.com/crflynn/pypistats.org, https://github.com/RootLUG/aura, https://github.com/jantman/pypi-download-stats, https://github.com/scikit-hep/scikit-hep-orgstats, https://github.com/di/pyreadiness

hugovk commented 5 years ago

Sounds good! One concern is the amount of BigQuery quota used, to ensure two requests can be made each week with the free quota. Hopefully it won't increase the amount used too much, but it would be nice to see the difference.

pypinfo reports how big each query is, you can see it in the json here.

hugovk commented 5 years ago

Good list! (I need to make a list of things using this data, too.)

Of those, https://github.com/psincraian/pepy and https://github.com/crflynn/pypistats.org are websites which essentially cache BigQuery data.

The latter is especially good and has an API, for which I've written a CLI client:

https://pypistats.org/api/ https://github.com/hugovk/pypistats

The data is limited to 6 months, and both pepy and pypistats.org don't have this specific mapping we're talking about. But maybe they could?

jayvdb commented 5 years ago

One concern is the amount of BigQuery quota used, to ensure two requests can be made each week with the free quota.

It shouldnt be extra queries - just slightly slower queries, assuming the SQL engine is halfway decent.

Based on your recommendation, I've created issues in both of those projects to see which, if any, have an interest.

You'll be interested to learn that pepy is growing an API https://github.com/psincraian/pepy/commit/b3cf4eead51d78e7594cf76757cc2aeb4c8b1e49

jayvdb commented 5 years ago

Now I have the SQL changes needed (see queries at https://github.com/psincraian/pepy/issues/128#issuecomment-491665411), I've also created an issue at https://github.com/ofek/pypinfo/issues/73 before doing the change there.