repology / repology-updater

Repology backend service to update repository and package data
https://repology.org
GNU General Public License v3.0
503 stars 176 forks source link

Fix PyPi support #278

Closed AMDmi3 closed 4 years ago

AMDmi3 commented 7 years ago

PyPi removed their all packages index: https://pypi.python.org/pypi/ so it's no longer parsed. Need to find a replacement

FRidh commented 7 years ago

Check out the xmlrpc API: https://wiki.python.org/moin/PyPIXmlRpc

I use the API here https://github.com/FRidh/make-pypi-dump/blob/1c158f3dda1848c7020f6c0fd4839e32530db5cc/make-pypi-dump#L33

AMDmi3 commented 7 years ago

It says it's deprecated

The XMLRPC interface for PyPI is considered legacy and should not be used

and there's no way to get at least names+versions for all packages with a single request. Scraping all 100k+ packages is not an option.

FRidh commented 7 years ago

Yet they did implement the methods on the new Warehouse. I wouldn't worry about that, at least not for now. They definitely won't remove that without having an alternative.

AMDmi3 commented 7 years ago

Still it's useless without a way to get all data with a single request.

FRidh commented 7 years ago

Scraping all 100k+ packages is not an option.

You could scrape only what has changed, e.g. like I did with the script I referred to. The following link contains a dump of all the JSON that is updated daily. It uses the xmlrpc api to determine which packages were changed, and performs requests for those.

If interested, I could update the Travis job to create a file listing all packages along with the latest version.

AMDmi3 commented 7 years ago

Ok, let me experiment a bit.

FRidh commented 7 years ago

This is the link I meant to include https://github.com/FRidh/pypi-dump

mgeier commented 7 years ago

PyPI support would be great!

Python packages typically get updated first on PyPI. If the information from PyPI isn't available on repology, the other packages are displayed as "up-to-date" (in green) even though a newer PyPI release is already available.

mgeier commented 6 years ago

I think this is a/the relevant issue: https://github.com/pypa/warehouse/issues/347

mgeier commented 6 years ago

Would the API for https://libraries.io/pypi help, as suggested by https://github.com/pypa/warehouse/issues/347#issuecomment-373251938?

AMDmi3 commented 6 years ago

Thanks for the pointer, I'll investigate this. I already don't like that it requires registration though.

nemani commented 6 years ago

Any updates on this?

AMDmi3 commented 6 years ago

Unfortunately, libraries.io is not suitable - I've commented in warehouse issue.

There's still no way to fetch information on all PyPi packages.

AMDmi3 commented 6 years ago

And I've investigated method suggested by @FRidh - while inconvenient (needs a lot of time to bootstrap and persistent storage) and incomplete (only supplies versions), it could work as a temporary solution (use list_packages xmlrpc method to get all package names, scrape them with package_releases by one on first run, then only use changelog to get updates), but the problem is that changelog method doesn't differentiate stable and developer releases. So the changelog will supply data like

>>> [i for i in client.changelog(1528310115-3600) if i[0] == 'toil-vg']
[['toil-vg', '1.4.1a1.dev1044', 1528309044, 'new release'], ['toil-vg', '1.4.1a1.dev1044', 1528309044, 'add source file toil-vg-1.4.1a1.dev1044.tar.gz'], ['toil-vg', '1.4.1a1.dev1044', 1528309045, 'add 2.7 file toil_vg-1.4.1a1.dev1044-py2.7.egg']]

while the stable version is like

>>> client.package_releases('toil-vg')
['1.2.0']
ackalker commented 5 years ago

@AMDmi3 There may yet be a a workable solution for getting at least a list of all package names: scraping the HTML returned by https://pypi.python.org/simple/ (link suggested by this comment on https://python-forum.io), which appears to contain a list (with links to details) of all available packages. If anything, retrieving the list and spidering the links it contains is fast, making me wonder if the whole thing is a static website, regenerated on a schedule or something. On my machine (with GbE internet connection):

$ time curl 'https://pypi.org/simple/' -o out.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9478k  100 9478k    0     0  30.4M      0 --:--:-- --:--:-- --:--:-- 30.3M

real    0m0.315s
user    0m0.048s
sys 0m0.045s

Number of lines and sample snippets of the content:

$ wc -l out.html 
181017 out.html
$ head out.html 
<!DOCTYPE html>
<html>
  <head>
    <title>Simple index</title>
  </head>
  <body>
    <a href="/simple/0/">0</a>
    <a href="/simple/0-0/">0-._.-._.-._.-._.-._.-._.-0</a>
    <a href="/simple/0-0-1/">0.0.1</a>
    <a href="/simple/00print-lol/">00print_lol</a>
$ tail out.html 
    <a href="/simple/zzr/">zzr</a>
    <a href="/simple/zzyzx/">zzyzx</a>
    <a href="/simple/zzz/">zzz</a>
    <a href="/simple/zzzeeksphinx/">zzzeeksphinx</a>
    <a href="/simple/zzzfs/">zzzfs</a>
    <a href="/simple/zzzutils/">zzzutils</a>
    <a href="/simple/zzz-web/">zzz-web</a>
    <a href="/simple/zzzzzzzzz/">zzzZZZzzz</a>
    </body>
</html>
$ time curl 'https://pypi.org/simple/00print-lol/' -o 00print-lol.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   672  100   672    0     0  18162      0 --:--:-- --:--:-- --:--:-- 18666

real    0m0.047s
user    0m0.018s
sys 0m0.005s
$ cat 00print-lol.html 
<!DOCTYPE html>
<html>
  <head>
    <title>Links for 00print_lol</title>
  </head>
  <body>
    <h1>Links for 00print_lol</h1>
    <a href="https://files.pythonhosted.org/packages/28/77/b367493f392d23b5e91220a92ec87aa94ca0ef4ee82b7baacc13ca48c585/00print_lol-1.0.0.tar.gz#sha256=03a146dc09b0076f2e82d39563a5b8ba93c64536609d9806be7b5b3ea87a4162">00print_lol-1.0.0.tar.gz</a><br/>
    <a href="https://files.pythonhosted.org/packages/c6/ab/4a317ae0d0c7c911f1c77719c553fc46a12d981899ceb5d47220fc3d535c/00print_lol-1.1.0.tar.gz#sha256=c452b0cc78f3a5edecbc6d160d2fa14c012d78403b0206558bcf1444eb5d1e2e">00print_lol-1.1.0.tar.gz</a><br/>
    </body>
</html>
<!--SERIAL 4405030-->

Alas, no distinguishing of stable and development versions here either, but still I hope this can be of some use, at least for bootstrapping and filling the bulk of the database. I have no idea whether and how long this feature will remain available. I do think it is a good workaround, and hope to see full PyPI support in Repology return soon :-)

AMDmi3 commented 5 years ago

@ackalker unfortunately, none of this is usable.

ackalker commented 5 years ago

That's too bad. Could you please explain what is missing with this approach? The obvious lack of metadata besides package version numbers is the same as with most other methods. I would guess it is at least usable for bootstrapping (getting the initial list of package names and available versions). Given it's speedy transfer, the list page could even be useful for quickly spotting package additions/removals.

AMDmi3 commented 5 years ago
ackalker commented 5 years ago

Thanks for the explanation. Please see pypa/warehouse#2912 for more recent upstream discussion, JSON API.

AMDmi3 commented 5 years ago

That issue was closed a year ago. And I doubt an API would be of any help, as APIs usually do not provide bulk data access (and as mentioned before, doing a lot of requests is not acceptable). Topology needs a dump .

abitrolly commented 5 years ago

https://packaging.python.org/guides/analyzing-pypi-package-downloads/ - can that help? BigQuery most likely requires registration and there are some limits.

AMDmi3 commented 5 years ago

It needs google account. This is not acceptable.

davidak commented 4 years ago

The data is soon accessible under the the-psf.pypi.distribution_metadata public dataset on BigQuery.

https://github.com/pypa/warehouse/issues/7403#issuecomment-663131927

It needs google account. This is not acceptable.

So you need a JSON export? Would daily be OK? I fear the traffic alone would be a problem for hourly.

AMDmi3 commented 4 years ago

Since there's currently no PyPI support at all, any update frequency would be OK.

davidak commented 4 years ago

@AMDmi3 would it be acceptable for you to use kaggle? You need an account to download datasets from there, but you can create one with e-mail and password. It's owned by Google.

AMDmi3 commented 4 years ago

No, google crap which requires registration would definitely not be acceptable.

abitrolly commented 4 years ago

What is the size of the index?

If it is too big to be hosted on CDN, then maybe with JSON API it is possible to update once and then subscribe to updates?

pradyunsg commented 4 years ago

Version information has to be extracted from file names, I don't think it's reliable

Just wanted to hop in to say -- it is. pip also relies on this information and shouts at the user if things don't match.

no distinguishing of stable and development versions here either

There is -- https://www.python.org/dev/peps/pep-0440/ specifies how "development" versions are different from "stable" versions.

AMDmi3 commented 4 years ago

Close as there's nothing to do here on repology side until PyPI or someone publishes an usable index.

abitrolly commented 4 years ago
  1. Get the size of index
  2. Speak with Fastly
  3. Setup sync script

PyPi is not a person and can not do the steps above.

AMDmi3 commented 4 years ago

PyPI is maintained by people, and it's their work to make it accessible, especially after they broke it by removing the index. I'm definitely not doing it for them.

abitrolly commented 4 years ago

@AMDmi3 I am sure that nobody works at PyPI.

AMDmi3 commented 4 years ago

I've had to do PyPI developers' work after all, as PyPI data is too important for Repology users.

I've set up a dedicated service at https://pypicache.repology.org/ which talks to PyPI via APIs and updates metadata on packages which were recently changed. I've also populated it with PyPI modules already present on Repology so it should be fairly complete for the task.

abitrolly commented 4 years ago

@AMDmi3 looks awesome. )

An endpoint to get information on individual project - not suitable as it requires thousands of HTTP requests to fetch data on all packages.

I am afraid https://release-monitoring.org/ does just that. :D @Zlopez can tell more.

Zlopez commented 4 years ago

Yes, we are doing this. We are trying to check each project once in a hour. And as @AMDmi3 is saying there are thousands of HTTP requests each hour.

You can check how many projects were checked in last run at the bottom of https://release-monitoring.org/ page.

abitrolly commented 4 years ago

@Zlopez there is no release-monitoring API to get all data on PyPI projects for comparison, right?

Zlopez commented 4 years ago

There is https://anitya.readthedocs.io/en/stable/api.html#http-api-v2 You just need to specify the ecosystem for /api/v2/projects/ GET request.

abitrolly commented 4 years ago

API query returned 91754. https://pypicache.repology.org/ reports 26796 packages.

AMDmi3 commented 4 years ago

I am afraid https://release-monitoring.org/ does just that

It's not really suitable for Repology.

We are trying to check each project once in a hour

I don't think there's a point in that since PyPI has xmlrpc endpoint which returns projects changed since given timestamp and allows one to only recheck updated projects with a lot smaller lag without skipping any updates. I still have to confirm that there haven't been any lost updates though.

API query returned 91754. https://pypicache.repology.org/ reports 26796 packages.

As mentioned, these 26k fully cover Repology needs. If there's demand, it can pull in all 274k pypi packages.

abitrolly commented 3 years ago

Interesting coincidence that at the same day when Google Big Query went offline, PyPI XML-RPC started to experience overload. But Google incident was reported 12:07 UTC and PyPI report is dated 09:41 UTC.

image

Interesting that Atlassian Statuspage used by https://status.python.org/ doesn't highlight the incident in the log in any way.

image

swills commented 3 years ago

Interesting coincidence that at the same day when Google Big Query went offline, PyPI XML-RPC started to experience overload. But Google incident was reported 12:07 UTC and PyPI report is dated 09:41 UTC.

This seems like the wrong place to report this issue.

abitrolly commented 3 years ago

I was just curious to check if the reason for PyPI XML-RPC were people who've lost access to BigQuery and needed alternative way to gather the same stats.