Closed AMDmi3 closed 4 years ago
Check out the xmlrpc API: https://wiki.python.org/moin/PyPIXmlRpc
I use the API here https://github.com/FRidh/make-pypi-dump/blob/1c158f3dda1848c7020f6c0fd4839e32530db5cc/make-pypi-dump#L33
It says it's deprecated
The XMLRPC interface for PyPI is considered legacy and should not be used
and there's no way to get at least names+versions for all packages with a single request. Scraping all 100k+ packages is not an option.
Yet they did implement the methods on the new Warehouse. I wouldn't worry about that, at least not for now. They definitely won't remove that without having an alternative.
Still it's useless without a way to get all data with a single request.
Scraping all 100k+ packages is not an option.
You could scrape only what has changed, e.g. like I did with the script I referred to. The following link contains a dump of all the JSON that is updated daily. It uses the xmlrpc api to determine which packages were changed, and performs requests for those.
If interested, I could update the Travis job to create a file listing all packages along with the latest version.
Ok, let me experiment a bit.
This is the link I meant to include https://github.com/FRidh/pypi-dump
PyPI support would be great!
Python packages typically get updated first on PyPI. If the information from PyPI isn't available on repology, the other packages are displayed as "up-to-date" (in green) even though a newer PyPI release is already available.
I think this is a/the relevant issue: https://github.com/pypa/warehouse/issues/347
Would the API for https://libraries.io/pypi help, as suggested by https://github.com/pypa/warehouse/issues/347#issuecomment-373251938?
Thanks for the pointer, I'll investigate this. I already don't like that it requires registration though.
Any updates on this?
Unfortunately, libraries.io is not suitable - I've commented in warehouse issue.
There's still no way to fetch information on all PyPi packages.
And I've investigated method suggested by @FRidh - while inconvenient (needs a lot of time to bootstrap and persistent storage) and incomplete (only supplies versions), it could work as a temporary solution (use list_packages
xmlrpc method to get all package names, scrape them with package_releases
by one on first run, then only use changelog
to get updates), but the problem is that changelog
method doesn't differentiate stable and developer releases. So the changelog will supply data like
>>> [i for i in client.changelog(1528310115-3600) if i[0] == 'toil-vg']
[['toil-vg', '1.4.1a1.dev1044', 1528309044, 'new release'], ['toil-vg', '1.4.1a1.dev1044', 1528309044, 'add source file toil-vg-1.4.1a1.dev1044.tar.gz'], ['toil-vg', '1.4.1a1.dev1044', 1528309045, 'add 2.7 file toil_vg-1.4.1a1.dev1044-py2.7.egg']]
while the stable version is like
>>> client.package_releases('toil-vg')
['1.2.0']
@AMDmi3 There may yet be a a workable solution for getting at least a list of all package names: scraping the HTML returned by https://pypi.python.org/simple/ (link suggested by this comment on https://python-forum.io), which appears to contain a list (with links to details) of all available packages. If anything, retrieving the list and spidering the links it contains is fast, making me wonder if the whole thing is a static website, regenerated on a schedule or something. On my machine (with GbE internet connection):
$ time curl 'https://pypi.org/simple/' -o out.html
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9478k 100 9478k 0 0 30.4M 0 --:--:-- --:--:-- --:--:-- 30.3M
real 0m0.315s
user 0m0.048s
sys 0m0.045s
Number of lines and sample snippets of the content:
$ wc -l out.html
181017 out.html
$ head out.html
<!DOCTYPE html>
<html>
<head>
<title>Simple index</title>
</head>
<body>
<a href="/simple/0/">0</a>
<a href="/simple/0-0/">0-._.-._.-._.-._.-._.-._.-0</a>
<a href="/simple/0-0-1/">0.0.1</a>
<a href="/simple/00print-lol/">00print_lol</a>
$ tail out.html
<a href="/simple/zzr/">zzr</a>
<a href="/simple/zzyzx/">zzyzx</a>
<a href="/simple/zzz/">zzz</a>
<a href="/simple/zzzeeksphinx/">zzzeeksphinx</a>
<a href="/simple/zzzfs/">zzzfs</a>
<a href="/simple/zzzutils/">zzzutils</a>
<a href="/simple/zzz-web/">zzz-web</a>
<a href="/simple/zzzzzzzzz/">zzzZZZzzz</a>
</body>
</html>
$ time curl 'https://pypi.org/simple/00print-lol/' -o 00print-lol.html
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 672 100 672 0 0 18162 0 --:--:-- --:--:-- --:--:-- 18666
real 0m0.047s
user 0m0.018s
sys 0m0.005s
$ cat 00print-lol.html
<!DOCTYPE html>
<html>
<head>
<title>Links for 00print_lol</title>
</head>
<body>
<h1>Links for 00print_lol</h1>
<a href="https://files.pythonhosted.org/packages/28/77/b367493f392d23b5e91220a92ec87aa94ca0ef4ee82b7baacc13ca48c585/00print_lol-1.0.0.tar.gz#sha256=03a146dc09b0076f2e82d39563a5b8ba93c64536609d9806be7b5b3ea87a4162">00print_lol-1.0.0.tar.gz</a><br/>
<a href="https://files.pythonhosted.org/packages/c6/ab/4a317ae0d0c7c911f1c77719c553fc46a12d981899ceb5d47220fc3d535c/00print_lol-1.1.0.tar.gz#sha256=c452b0cc78f3a5edecbc6d160d2fa14c012d78403b0206558bcf1444eb5d1e2e">00print_lol-1.1.0.tar.gz</a><br/>
</body>
</html>
<!--SERIAL 4405030-->
Alas, no distinguishing of stable and development versions here either, but still I hope this can be of some use, at least for bootstrapping and filling the bulk of the database. I have no idea whether and how long this feature will remain available. I do think it is a good workaround, and hope to see full PyPI support in Repology return soon :-)
@ackalker unfortunately, none of this is usable.
That's too bad. Could you please explain what is missing with this approach? The obvious lack of metadata besides package version numbers is the same as with most other methods. I would guess it is at least usable for bootstrapping (getting the initial list of package names and available versions). Given it's speedy transfer, the list page could even be useful for quickly spotting package additions/removals.
Thanks for the explanation. Please see pypa/warehouse#2912 for more recent upstream discussion, JSON API.
That issue was closed a year ago. And I doubt an API would be of any help, as APIs usually do not provide bulk data access (and as mentioned before, doing a lot of requests is not acceptable). Topology needs a dump .
https://packaging.python.org/guides/analyzing-pypi-package-downloads/ - can that help? BigQuery most likely requires registration and there are some limits.
It needs google account. This is not acceptable.
The data is soon accessible under the the-psf.pypi.distribution_metadata public
dataset on BigQuery.
https://github.com/pypa/warehouse/issues/7403#issuecomment-663131927
It needs google account. This is not acceptable.
So you need a JSON export? Would daily be OK? I fear the traffic alone would be a problem for hourly.
Since there's currently no PyPI support at all, any update frequency would be OK.
@AMDmi3 would it be acceptable for you to use kaggle? You need an account to download datasets from there, but you can create one with e-mail and password. It's owned by Google.
No, google crap which requires registration would definitely not be acceptable.
What is the size of the index?
If it is too big to be hosted on CDN, then maybe with JSON API it is possible to update once and then subscribe to updates?
Version information has to be extracted from file names, I don't think it's reliable
Just wanted to hop in to say -- it is. pip also relies on this information and shouts at the user if things don't match.
no distinguishing of stable and development versions here either
There is -- https://www.python.org/dev/peps/pep-0440/ specifies how "development" versions are different from "stable" versions.
Close as there's nothing to do here on repology side until PyPI or someone publishes an usable index.
PyPi is not a person and can not do the steps above.
PyPI is maintained by people, and it's their work to make it accessible, especially after they broke it by removing the index. I'm definitely not doing it for them.
@AMDmi3 I am sure that nobody works at PyPI.
I've had to do PyPI developers' work after all, as PyPI data is too important for Repology users.
I've set up a dedicated service at https://pypicache.repology.org/ which talks to PyPI via APIs and updates metadata on packages which were recently changed. I've also populated it with PyPI modules already present on Repology so it should be fairly complete for the task.
@AMDmi3 looks awesome. )
An endpoint to get information on individual project - not suitable as it requires thousands of HTTP requests to fetch data on all packages.
I am afraid https://release-monitoring.org/ does just that. :D @Zlopez can tell more.
Yes, we are doing this. We are trying to check each project once in a hour. And as @AMDmi3 is saying there are thousands of HTTP requests each hour.
You can check how many projects were checked in last run at the bottom of https://release-monitoring.org/ page.
@Zlopez there is no release-monitoring API to get all data on PyPI projects for comparison, right?
There is https://anitya.readthedocs.io/en/stable/api.html#http-api-v2 You just need to specify the ecosystem for /api/v2/projects/
GET request.
API query returned 91754. https://pypicache.repology.org/ reports 26796 packages.
I am afraid https://release-monitoring.org/ does just that
It's not really suitable for Repology.
We are trying to check each project once in a hour
I don't think there's a point in that since PyPI has xmlrpc endpoint which returns projects changed since given timestamp and allows one to only recheck updated projects with a lot smaller lag without skipping any updates. I still have to confirm that there haven't been any lost updates though.
API query returned 91754. https://pypicache.repology.org/ reports 26796 packages.
As mentioned, these 26k fully cover Repology needs. If there's demand, it can pull in all 274k pypi packages.
Interesting coincidence that at the same day when Google Big Query went offline, PyPI XML-RPC started to experience overload. But Google incident was reported 12:07 UTC and PyPI report is dated 09:41 UTC.
Interesting that Atlassian Statuspage used by https://status.python.org/ doesn't highlight the incident in the log in any way.
Interesting coincidence that at the same day when Google Big Query went offline, PyPI XML-RPC started to experience overload. But Google incident was reported 12:07 UTC and PyPI report is dated 09:41 UTC.
This seems like the wrong place to report this issue.
I was just curious to check if the reason for PyPI XML-RPC were people who've lost access to BigQuery and needed alternative way to gather the same stats.
PyPi removed their all packages index: https://pypi.python.org/pypi/ so it's no longer parsed. Need to find a replacement