pypi / legacy

This PyPI is no more! See https://github.com/pypa/warehouse.
Other
61 stars 46 forks source link

(Question) Available clone of PyPi's database? #471

Closed AlexandreDecan closed 8 years ago

AlexandreDecan commented 8 years ago

Hello there,

I'm a researcher at the University of Mons, in Belgium. Similarly to what we did some months ago about the R ecosystem (involving CRAN), we plan to study the Python ecosystem (involving PyPi) (if interested, see here for the official publications and, non-officially, here for the PDF)

In order to do that, we (among other) need to download a copy of all versions of all packages' metadata (and probably a copy of all versions of all packages).

Is there a way I can get all those data without making a huge amount of requests against the web server (through RPC)? I saw there is a "testable PyPi", but is it both reliable and (quite) up to date? If not, could you please tell me the suggested "rate limit" to avoid disturbing the server?

Thank you! PS : I didn't manage to find a "better" way to contact PyPi team. I hope you don't mind for this bug request!

ewdurbin commented 8 years ago

@AlexandreDecan looks like your links to the publications did not survive formatting.

TestPyPI is the staging area for new changes to PyPI, and is up to date code or ahead of production at all times.

We don't have anything to offer as far as a single downloadable asset for metadata or packages. Everything you've requested is available by stringing together a couple components:

Obtain a current list of packages:

>>> import xmlrpclib
>>> client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi')
>>> client.list_packages()

Obtain the release information for a package:

curl https://pypi.python.org/pypi/requests/json

Putting it together, you could do something like:

$ python -c "import xmlrpclib; client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi'); print '\n'.join(client.list_packages())" | xargs -I{} bash -c "curl -s https://pypi.python.org/pypi/{}/json > {}.json"
AlexandreDecan commented 8 years ago

Thank you for the answer. I'll have a look at bandersnatch, I didn't know about it and, according to what I saw about it, it will fit my needs ;-)

Sorry for the broken ("missing" is maybe the right word ;-) links. I copied my message from Bitbucket after seeing that the issues were migrated here.

Here is the official publication, and here is a downloadable PDF of the preprint.

ewdurbin commented 8 years ago

appreciate the links. look forward to hearing about the results/findings of your research.

AlexandreDecan commented 8 years ago

I'll try to provide a feedback ASAP, but do not expect something in the next few months ;)