pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 952 forks source link

Unable to get a consistent view of existing/changed packages from the various APIs #9536

Open pfmoore opened 3 years ago

pfmoore commented 3 years ago

Describe the bug I am trying to maintain a local copy of package metadata. That means calling the JSON API for any new or changed packages. But there's no JSON API to find what has changed, so I have to use the XMLRPC API.

The changed_packages and updated_releases APIs documented here don't seem to exist. So I use the list_packages_with_serial API to get information on when packages changed, and select out the records that I need.

However, it appears that list_packages_with_serial returns packages that don't have a JSON record:

>>> import xmlrpc.client
>>> XMLRPC = "https://pypi.org/pypi"
>>> pypi = xmlrpc.client.ServerProxy(XMLRPC)
>>> serials = pypi.list_packages_with_serial()
>>> serials['gnu']
9900977
>>> import httpx
>>> resp = httpx.get("https://pypi.org/pypi/gnu/json")
>>> resp.status_code
404

Looking at the changelog, it looks like the gnu package was deleted. OK, maybe if I use list_packages I can get just the current packages. I'll still need the list_packages_with_serial data, so it's an extra call, but maybe it will help. Nope, no such luck.

>>> all = pypi.list_packages()
>>> 'gnu' in all
True

The simple API also thinks gnu exists, although there are no links for it. The serial number on the simple page matches the XMLRPC value:

>>> resp = httpx.get("https://pypi.org/simple/gnu/")
>>> resp.status_code
200
>>> resp.text
'<!DOCTYPE html>\n<html>\n  <head>\n    <meta name="pypi:repository-version" content="1.0">\n    <title>Links for gnu</title>\n  </head>\n  <body>\n    <h1>Links for gnu</h1>\n    </body>\n</html>\n<!--SERIAL 9900977-->'
>>> serials['gnu']
9900977

So how can I get a list of all packages which can be successfully queried using the JSON API? If the various APIs gave consistent results, my current approach would work. But given the inconsistencies, the lack of a JSON API is difficult to work around. It's actually difficult to even define what my code should be doing - is a project that exists in the simple index but not in the JSON API a valid project? Pip can will recognise it and can try to install it, but there's no PyPI page for it.

There's also mention of an RSS API, but that doesn't seem to include a way to specify the date from which you want to see what's changed.

Expected behavior

The JSON, simple and XMLRPC APIs give consistent results.

To Reproduce

See above - the XMLRPC calls return the package gnu but it's not in the JSON API.

My Platform I'm not sure what is relevant here. I'm on Windows 10, with a simple network connection to PyPI through my ISP. I'm testing the APIs using adhoc Python code in Python 3.9 (my actual code is a more complex script, but the above snippets demonstrate the problem in isolation).

Additional context

Is there a better way that I should be using to (in effect) maintain a mirror of the PyPI metadata, without needing to make an excessive number of calls to the PyPI server? I've considered parsing the changelog data, but the undocumented and relatively free-form nature of the "action" field makes this seem even more fragile than my current approach.

I'm aware of the work on documenting and improving the JSON API happening at https://github.com/pypa/packaging-problems/issues/367 but the first stage of that seems more about reorganising the existing API, and not adding new functionality (and worryingly, there seems to be an implication that the XMLRPC API can be deprecated in favour of the JSON API, which clearly isn't the case while there's no "list all packages" and "list all changes since " APIs.

pfmoore commented 2 years ago

I'm seeing a similar issue with just the simple index - currently, package hp075 is listed in https://pypi.org/simple, but the linked page, https://pypi.org/simple/hp075/ is giving 404 Not Found.

From the changelog:

Name Serial Version Timestamp Action
hp075 14568879 2022-07-27 18:40:17 create
hp075 14568880 2022-07-27 18:40:17 add Owner ChaosSage
hp075 14568881 1.0.0 2022-07-27 18:40:17 new release
hp075 14568882 1.0.0 2022-07-27 18:40:17 add py3 file hp075-1.0.0-py3-none-any.whl
hp075 14568883 1.0.0 2022-07-27 18:40:19 add source file hp075-1.0.0.tar.gz
hp075 14568935 1.0.1 2022-07-27 18:46:33 new release
hp075 14568936 1.0.1 2022-07-27 18:46:33 add py3 file hp075-1.0.1-py3-none-any.whl
hp075 14568937 1.0.1 2022-07-27 18:46:35 add source file hp075-1.0.1.tar.gz
hp075 14904075 2022-08-26 20:52:14 remove project

it looks like the project was removed last night, and the project page on the simple index has gone, but not the entry on the root page.