pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.6k stars 963 forks source link

Add project last serial number to simple index "all projects" JSON output #12095

Closed pfmoore closed 2 years ago

pfmoore commented 2 years ago

What's the problem this feature will solve? At the moment, the only way to get a list of all projects with their "last updated" serial number is via the XMLRPC API.

Describe the solution you'd like Add an extra field, _last-serial, to the PEP 691 JSON output for the individual elements of the projects array on the root URL.

Additional context Currently, the meta element on both the root page and the individual project pages contain the _last-serial key, but there is no way to get a list of all projects and the last time they were updated in a single API call.

Having this information would be useful when mirroring data - by getting the list of all projects and their last update serial, it is possible to check what has been updated since the last mirror run.

Alternatives Considered It is, of course, possible to continue using the XMLRPC API, but that is significantly slower (over 10 seconds vs 0.25 sec). Of course, if the performance difference is the result of extracting the serial numbers, this idea is a non-starter and it's better to keep the simple index fast.

Using ETag values to avoid fetching unneeded data is reasonable (and probably should be done in any case) but given that the response body is often not much larger than the header, and even getting a "Not Modified" response needs a network round trip, it's not obvious that this is a significant saving (it's certainly not as good as never calling the server at all...)

pfmoore commented 2 years ago

I've created a basic PR, #12134, that adds this, to aid with review.

dstufft commented 2 years ago

Sorry I meant to respond then it slipped my mind.

I don't oppose this idea in theory (in fact I think it's a good idea, and is part of what will allow bandersnatch to stop using the XMLRPC API).

My biggest concern is whether we should standardize it through a PEP or not. I know that _last_serial is on the project specific pages already without a standard, I did that mostly because the existing HTML pages had that because it forces the ETag to change when the serial changes (which is important for our CDN). The big thing that confuses things is I'm not sure if anyone else but PyPI even has the idea of a serial or if that concept is useful at all except outside of PyPI and whether there is value in standardizing it at all [^1].

[^1]: Although I guess the value would be it gives things like bandersnatch a standard way to work?

pfmoore commented 2 years ago

IMO, we should do this now as a Warehouse-specific change, then think about standardising later. I can see the value of standardising something to let mirrors say "is this stale?" but does anyone mirror any index other than PyPI? I'd be inclined to wait for a use case before standardising.

On the other hand, standardising allows people like me (who aren't formally mirroring but are using the mechanisms involved) to feel more confident that the serial numbers won't be going away any time soon[^1].

[^1]: On which note, what's the status of the changelog data? I'm currently mirroring it via the XMLRPC API, and it's got some useful information in it, not least a means to roughly match serial numbers to time, but I'm very aware that I'm a niche use case here, and it would be very easy for the data to just disappear due to internal Warehouse changes.

dstufft commented 2 years ago

does anyone mirror any index other than PyPI? I'd be inclined to wait for a use case before standardising.

Does a mirror of a mirror count? AFAIK Bandersnatch supports mirroring from another mirror of bandersnatch, but I think mirroring in general is largely PyPI specific.

On which note, what's the status of the changelog data? I'm currently mirroring it via the XMLRPC API, and it's got some useful information in it, not least a means to roughly match serial numbers to time, but I'm very aware that I'm a niche use case here, and it would be very easy for the data to just disappear due to internal Warehouse changes.

See https://github.com/pypi/warehouse/issues/11918 for questioning whether we can deprecate the change log in favor of the newer and better audit logs (but which I don't think we've exposed via any API).

pfmoore commented 2 years ago

I did think about what would be involved in standardising a mirroring API. Obviously we could simply document what Warehouse provides, but all that really does is say that everyone else has to replicate Warehouse, which may or may not be easy (or even possible for them). Or we could design a "better" API, but that just means Warehouse has to change and we still don't have any assurance that other index providers can offer the same functionality.

I want to move off the XML-RPC API, so I'd prefer to get this done now, and then look at standardising as a follow-up discussion (which I'm willing to start on Discourse, but I'd prefer to get my current backlog, including this PR and its implications on my mirroring code, resolved first[^1]).

[^1]: Even if "resolved" means "the Warehouse devs rejected it because they want to stick with XML-RPC until we have a standardised solution".

dstufft commented 2 years ago

I've been thinking about this change, and I think ultimately the _ prefix, the lack of a standard, and the existing _last-serial on the project pages mean we can probably say that it's OK to add this and if we end up standardizing it and/or wanting to remove this, then those things signal the lack of a backwards compat guarantee that would let us remove/change them if needed.

One thought I do have, because there's sort of a inherent difference between how we cache the XMLRPC response and /simple/, and I want to at least raise awareness and make sure that it doesn't void the usefulness of this for you:

So then that opens the question of:

[^1]: This means that if the cached object is < 1 day old, then the cached object will be returned by the CDN. If the cached object is > 1 day old, but < 1 day + 5 minutes old, when a request is made the CDN will see that the cached object is "stale" but not "too stale", and will return the cached object immediately, but set a background task to fetch a new copy of the cached object from the origin servers and update it's copy for future requests. This "stale while revalidate" behavior is used across PyPI and is why you'll often see it taking two requests to fetch a new /simple/$project/ page after a release, and is used to provide an overall faster and more consistent response speed. [^2]: This was an explicit choice as these responses are very large and slow to render and historically the data on /simple/ wasn't really being used very much or in ways that required particularly fresh data.. but the exact times were largely chosen at random, and in some cases were chosen 7 years ago and never really formally evaluated. I think I recall someone mentioning an issue recently but I can't seem to find it, might have been on discourse, or maybe twitter or something where brand new projects fail to show up on /simple/ for an extended period of time, which is also due to this 1 day cache without invalidation. [^3]: I would be hesitant to invalidate /simple/ such that the serial numbers were always fully up to date in it's current form, largely because while PyPI's write traffic is relatively low, it's still high enough that we'd effectively be regenerating our slowest routes regularly. Maybe though there are some changes (project addition/deletion?) that we should invalidate on + shorten how long we cache for to help move people off of XMLRPC?

pfmoore commented 2 years ago

Great. I don't think the caching behaviour is likely to be a problem (for me, at least). I'll have to do some more thinking about the details, but my personal need is not for "up to the minute" data, but simply for not fetching stuff I don't need to[^1]. Add to that the fact that I don't have automated mirroring set up, so I simply do a manual refresh "every so often" (which is typically measured in weeks, not days or hours). As a result, I'm really not bothered about being behind a day old cache.

Having said that, I do think it might be useful to document the caching behaviour. Things like that have as you say important but non-obvious implications on how people can use the APIs. And people tend to only find out about them through heresay. Of course, it's easy to say that - there's always more documentation that can be added and never time to do so. So don't treat this as yet another demand on your time (or anyone else's) 🙂

[^1]: My reasons for that are two-fold. I'd like to not put more stress on PyPI than I have to, but also the bottleneck in my process is the single-threaded updates to a sqlite database, so the fewer pages I have to update, the better.

dstufft commented 2 years ago

Given the current caching strategy isn't going to be a problem for you, the person with the PR and the only one whose come forward with a solid use case for it and we can change the caching at any time without really affecting the public interface in a negative way I'm happy to go ahead and just accept it as is.

I did open https://github.com/pypi/warehouse/issues/12155 to track the suggestion of documenting our caching strategy.

dstufft commented 2 years ago

Completed in https://github.com/pypi/warehouse/pull/12134

It should be deployed in the next 10-15 minutes, then you'll have to wait until the responses fall out of the cache before you'll see it.