PyPI serving stale content from JSON/Simple API

chriseldredge commented 2 years ago

Likely related to #11936, I am seeing stale content served for the URL https://pypi.org/pypi/dbnd-postgres/json.

The following events show in changes_since_serial today:

['dbnd-postgres', '1.0.2.2', 1663173894, 'new release', 15093808]
['dbnd-postgres', '1.0.2.2', 1663173894, 'add py2.py3 file dbnd_postgres-1.0.2.2-py2.py3-none-any.whl', 15093809]
['dbnd-postgres', '1.0.2.2', 1663173939, 'add source file dbnd-postgres-1.0.2.2.tar.gz', 15093840]

However this curl command shows the serial ID did not pick up the most recent event:

curl -sq -I https://pypi.org/pypi/dbnd-postgres/json -H 'Fastly-Debug: 1' --compressed

Output:

HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-encoding: gzip
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/mathjax@3.2.2/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self'; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com *.ethicalads.io; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/mathjax@3.2.2/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "OQffMU7lDwd/T0iw2gQQVQ"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/dbnd-postgres
x-pypi-last-serial: 15093809
accept-ranges: bytes
date: Wed, 14 Sep 2022 20:20:20 GMT
fastly-debug-path: (D cache-iad-kiad7000060-IAD 1663186820) (F cache-iad-kcgs7200149-IAD 1663173907)
fastly-debug-ttl: (H cache-iad-kiad7000060-IAD 73486.697 86400.000 12913)
fastly-debug-digest: d7cf4a7d85536d23624f76df11d1059e7f9287eea031e6c73846429273a5acfd
x-served-by: cache-iad-kiad7000060-IAD
x-cache: HIT
x-cache-hits: 147
x-timer: S1663186820.267731,VS0,VE0
vary: Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 118223

When I repeat this command without the --compressed flag I see the expected output:

curl -sq -I https://pypi.org/pypi/dbnd-postgres/json -H 'Fastly-Debug: 1'

Output:

HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/mathjax@3.2.2/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self'; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com *.ethicalads.io; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/mathjax@3.2.2/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "XcI/eU0vTs6ZFLleB2JpKQ"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/dbnd-postgres
x-pypi-last-serial: 15093840
accept-ranges: bytes
date: Wed, 14 Sep 2022 20:27:40 GMT
fastly-debug-path: (D cache-iad-kiad7000106-IAD 1663187261) (F cache-iad-kiad7000060-IAD 1663173992)
fastly-debug-ttl: (H cache-iad-kiad7000106-IAD - - 13269)
fastly-debug-digest: d7cf4a7d85536d23624f76df11d1059e7f9287eea031e6c73846429273a5acfd
x-served-by: cache-iad-kiad7000106-IAD
x-cache: HIT
x-cache-hits: 1
x-timer: S1663187261.581965,VS0,VE3
vary: Accept-Encoding, Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 553040

This indicates to me that cache invalidation did not work.

I see from recent GitHub issues this has been a recurring issue recently. It is painful since without manual intervention it won't self-resolve until the TTL of 24 hours expires.

ewdurbin commented 1 year ago

Okay, we have now rolled out the workaround provided by Fastly.

I'd really appreciate updates on instances of stale caches over the next week or so!

TechCiel commented 1 year ago

Expected PyPI serial 16729491 for request https://pypi.org/pypi/cfy-lint/json but got 16199015.

since 2023-02-07 15:53:58 in UTC

I've tried some curl, finding that stale serial is always provided with --compress specified, but clean data comes at a chance (for each same IP) without that. Did the workaround took effect on all edges and both v4&v6?

ewdurbin commented 1 year ago

It seems we had a flurry of failed purge tasks in the first day after rolling out the work around

I've confirmed that one of those failures was for cfy-lint, but it should have been retried.

The failures have gone away.

Two things to look at,

Check with Fastly Support as to why we were getting failures (It seems to be related to our API token rotation)
Determine why the tasks were not retried

ewdurbin commented 1 year ago

On second look it appears that the tasks should have been retried... but we weren't catching them in metrics. #12968 will help us track this down if purge failures occur again.

ewdurbin commented 1 year ago

The failed purges were due to some eventual consistency that should have settled. We haven't seen any failed purge tasks in our logs in the past 48 hours. @chriseldredge @TechCiel can you provide updates on what you're seeing for stale content?

TechCiel commented 1 year ago

Hi, @ewdurbin , glad to hear that! Had a look on the log, the last long-lasting error was on serotiny, resolved after Feb 7 23:21 UTC.

Some stale content still appears, but quickly got purged after a few retries without causing any problem. I wonder if that was because I kept the pace too close, just between the update & purge.

I've disabled my "check uncompressed" workaround and will keep monitoring the status. I'll let you know if the mirror is broken. Thanks all!

TechCiel commented 1 year ago

Just hours after switching off the "check uncompressed" workaround, the mianxiang-wangluo-zhuanjia-de-linux-jifan (16801627) gone brrrrrr.... Re-enabling the workaround fixed the problem, maybe just this time, though.

TechCiel commented 1 year ago

Package maturin gone stale about 3 hours ago. As I check the recent logs, I've found some outburst of stale content.

I've paste the result of grep 'Giving up' pypi-*.log here: https://pastebin.com/BhzZGxd7

Please note the log is in UTC+8. Sadly, this seems like we're still far from actually fixing that problem.

ewdurbin commented 1 year ago

Thanks @TechCiel, we're continuing to follow up with Fastly support.

@chriseldredge do you have any updates?

chriseldredge commented 1 year ago

Hi @ewdurbin, unfortunately it looks like the mitigations have not addressed the stale cache bug in a meaningful way. It may be coincidence but anecdotally things actually look worse on some recent days than they did in January.

Here's a breakdown of the count of times my replicator gave up after waiting 10 minutes for the serial to agree on the package info page with what the XML-RPC api said was highest serial at the time:

   1 2023-01-28
   1 2023-01-29
   3 2023-02-06
  13 2023-01-24
  13 2023-01-26
  14 2023-02-01
  26 2023-01-30
  53 2023-02-02
  81 2023-02-09
  82 2023-02-08
  97 2023-02-07
 101 2023-02-12
 101 2023-02-16
 109 2023-02-11
 160 2023-02-10
 210 2023-02-13
 318 2023-02-15
 909 2023-02-14

Valentines day was rough!

ewdurbin commented 1 year ago

Oof, that's not looking great at all. @chriseldredge can you help us understand what those numbers mean? Are these all from a single "replicator" and do they include repeats/retries?

We're gonna keep working with Fastly to see if there's something else we can do 😓

chriseldredge commented 1 year ago

Those are counts of the number of times my replicator service gave up after 10 minutes waiting for the package info page to be up to date with xmlrpc. There are many duplicates since sometimes the cache purges appear to fail around the globe. If it is useful, please let me know more specifically what data and format would help you continue to investigate the issue.

I'll reiterate that the problem appears to affect packages with a high number of releases (e.g. checkov), which makes me wonder if the origin is not actually completing whatever asynchronous workflows prior to the cache purge being requested. In other words, if the purge happens before the next call to the package page would return up to date information, the race condition is in the origin.

dstufft commented 1 year ago

There's no async workflows in uploading a package and having the origin serve that currently.

There's a single http handler that makes changes to the database, commits it, then fires off a purge task.

There's another http handler that reads committed things from the database to serve the simple pages.

It shouldn't be possible for the purge task to fire until after the database has committed it, at which point the origin should already be serving the updated content.

dstufft commented 1 year ago

Just to document this all here, I've been doing some thinking on this problem to try and figure out if we can do something to fix or mitigate it.

I've yet to come up with a reasonable way for us to fix this on our end unless there's some combination of Fastly features that we're unaware of that would make it work, and so far Fastly hasn't been able to get us a resolution either. That pretty much leaves us with what we can do to mitigate the issue, which I think maybe I've come up with a workable strategy, but it requires some subtle changes to how PyPI functions so I figured I'd lay it out here and see what people think.

Essentially this problem boils down to the fact that Fastly + Warehouse form a complex and large distributed system where it is always going to be some levels of eventually consistent. The hope was that by using purging we could force the CDN to become consistent "right now", which meant that we could tune our parameters so that eventually consistent could be for quite a long time. Unfortunately for whatever reason, purging has not been reliable enough for this use case, which means that we're falling back to "quite a long time" regularly which breaks mirroring until the service becomes consistent again (eventually!).

So the basic thrust of our mitigation here is that we retune our TTL to drop it down from 24 hours to a value that is more reasonable, maybe 10 minutes or 15 minutes, which means that we should expect the system to self heal and become consistent on it's own even without purging in 10-15 minutes (or maybe 20-30 minutes with shielding? I'm not sure offhand).

On it's own that would mean that the number of requests that actually hit the origin servers will go up by quite a lot, each cached request will go from hitting the origin servers every 24h to every 10-30 minutes, or like a ~150x worst case increased number of requests to the origin servers for some of our heaviest views (if I recall at least).

Fastly does have a feature to mitigate the impact of these (which we already employ) where rather than just blindly fetching a whole new response after the existing object expires, it will instead attempt to re-validate its existing object using a conditional HTTP request, where it will say "Hey I want X, but only if the ETag is no longer Z", which Warehouse will look at and go "hey it hasn't changed", 304 Not Modified with an empty response body to tell Fastly to use it's existing object and treat it as new, or "it's changed, 200 OK" with the full response.

Unfortunately the way that Warehouse works is that to process a conditional http request it first runs the view code and does all of the work to generate a response, then at the last minute it looks to see if the request and response have matching ETags, and if they do it throws away the response it just generated and returns an empty 304 Not Modified, but it still had to do all of the work to actually generate that response it threw away.

Fundamentally there's no reason why you have to generate the response prior to processing the conditional request, except that the way Warehouse handles generating an ETag is to generate a md5 hash of the response body [^1] because if you want to generically generate an ETag, there's not really a better way to do it [^2]. That means that as implemented, we can't know what the ETag of a response is going to be, without generating the response.

However, if we had a lightweight way to predetermine what the ETag was going to be, we could process the conditional request up front and skip generating the response body if we're going to return a 304 Not Modified. As it so happens, for the pages that we really care about here, we could just use the last serial as our ETag (or derive it from last serial somehow) which we do have a lightweight way to look up what our last serial for a project is (it's a column off the projects table).

That would be great, except that there's another wrinkle. We have middleware in Warehouse that handles compression so that we can safely and intelligently have compression enabled without falling victim to CRIME and also reduces our egress to Fastly (at the cost of the CPU overhead to compress).

Obviously to compress the response body we have to have the response body, so that middleware doesn't function without having the full response body. That middleware also has to implement it's own ETag scheme because the ETag for a compressed and uncompressed response needs to differ.

The simplest thing to do would be to remove compression from Warehouse and shift that to happen in the Fastly layer (https://github.com/pypi/infra/pull/120), unfortunately that means that our egress costs would end up going up. If we did that, then we could have Warehouse short circuit on conditional requests, which would hopefully mean that we could reduce our TTL and make this whole problem a lot more self healing.

I'm experimenting with options to keep the compression in Warehouse and still allow this short circuiting on conditional requests behavior.

One option is to turn our responses into streaming responses, but Pyramid can sometimes implicitly consume the streaming response and make it non-streaming if anything attempts to access the response body. Since that happens implciitly it means we wouldn't short circuit if anything caused that to happen but it would be non obvious that it was happening, so it would end up being fairly fragile I think.

I'll see if I can come up with any other options though.

[^1]: Fun fact, this is why last serial is included in those response bodies, because otherwise changes that incremented the serial but didn't affect the response body would mean that Fastly would resurrect the old cached object, with the old last serial header, so we had to force the request body to change for every last serial increment. [^2]: To really be accurate, you'd also want to hash some or most of the response headers as well, but just doing the body gets you 95% of the way there.

chriseldredge commented 1 year ago

It seems to me that your Fastly cache invalidation does work most of the time, and the cases where my replicator runs into trouble are generally identical to a case I observed earlier today.

My replicator running in us-east-1 got stuck processing changes to spanishconjugator:

2023-02-23T09:25:15Z Expected package spanishconjugator refresh serial to be greater or equal to 16990807 but was 16990805

Looking at relevant entries in the Warehouse xmlrpc changelog_since_serial api:

Thu Feb 23 09:24:19 2023 spanishconjugator 2.3.4236 serial=16990804 action=new release
Thu Feb 23 09:24:19 2023 spanishconjugator 2.3.4236 serial=16990805 action=add py3 file spanishconjugator-2.3.4236-py3-none-any.whl
Thu Feb 23 09:24:21 2023 spanishconjugator 2.3.4236 serial=16990807 action=add source file spanishconjugator-2.3.4236.tar.gz

This is similar to all cases I've investigated. It looks like the cache invalidation manages to nudge Fastly to refresh the package pages the first and maybe second time (I don't know off hand if "new release" triggers invalidation). But two seconds after the py3 file is added, the source file gets added. This is the change that appears to fail to propagate some of the time through Fastly.

Supposing that the origin Warehouse system appears consistent in the absence of a CDN, e.g. that information that appears in changelog_since_serial responses is never newer than what appears in the simple/json APIs for a given package, then it seems we can narrow down the scope of this problem to how Fastly works in a distributed network when there are overlapping activities. Hypothetically, here's a sequence that could explain what we see:

Maintainer uploads a py3 file
Warehouse processes the transaction
Warehouse issues a cache invalidation to Fastly
Fastly processes the cache purge request from (3)
A replicator asks for the package info page
The Fastly node sees a cache miss and requests the package info page from origin
Origin begins processing the request. It takes a while to load and transfer because there are hundreds of releases for this package.
Maintainer uploads a source file
Warehouse processes the transaction
Warehouse issues a cache invalidation to Fastly
Fastly processes the cache purge request from (10)
The response from (7) completes transferring back to Fastly where it gets inserted into the cache despite being stale.
The stale response gets returned to the viewer

The precise order may not matter, with the key concept being that the final cache invalidation instruction gets processed before an in-flight request/response completes.

I don't have familiarity of the timing of cache invalidation requests sent to Fastly, but if this hypothesis is correct then it seems like debouncing invalidation requests could resolve the issue. In the above case, the two events came within 2 seconds of each other. A debounce threshold of 5 or 10 seconds could be adequate. This can also happen when a maintainer publishes more than one version of their package in quick succession, so the debouncing needs to be aggregated at the package level, not at a release version level.

di commented 1 year ago

Thanks @chriseldredge, that aligns with what I suspect is happening as well. @ewdurbin and I are roughing out an approach to batch purge multiple unique keys with a short delay (essentially, debouncing) to see if that mitigates the issue.

TechCiel commented 1 year ago

expose-localhost has been staled since 0:22 UTC.

TechCiel commented 1 year ago

expose-localhost has been staled since 0:22 UTC.

Is there anyone looking into this matter? This incident has not been resolved since 8 days ago.

shankerwangmiao commented 1 year ago

Hi, can anyone have a look at the package expose-localhost, it seems that the JSON api is giving out stale response again since about 2 weeks ago.

2023-04-06 12:26:33,912 ERROR: Error syncing package: expose-localhost@17357317 (mirror.py:369)
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/bandersnatch/package.py", line 61, in update_metadata
    self._metadata = await master.get_package_metadata(
  File "/usr/local/lib/python3.9/site-packages/bandersnatch/master.py", line 240, in get_package_metadata
    metadata_response = await metadata_generator.asend(None)
  File "/usr/local/lib/python3.9/site-packages/bandersnatch/master.py", line 158, in get
    await self.check_for_stale_cache(path, required_serial, got_serial)
  File "/usr/local/lib/python3.9/site-packages/bandersnatch/master.py", line 137, in check_for_stale_cache
    raise StalePage(
bandersnatch.master.StalePage: Expected PyPI serial 17357317 for request https://pypi.org//pypi/expose-localhost/json but got 17357314. HTTP PURGE has been issued to the request url

ewdurbin commented 1 year ago

expose-localhost should be consistent again: https://github.com/pypi/warehouse/issues/12933#issuecomment-1499006912

shankerwangmiao commented 1 year ago

expose-localhost should be consistent again: #12933 (comment)

Hi, many thanks for your reply. However, I failed to see the change on my side. I wonder if purging the CDN is needed.

# curl  -H 'Fastly-Debug: 1' -sq -I "https://pypi.org//pypi/expose-localhost/json"
HTTP/2 200 
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ https://api.github.com/search/issues https://*.google-analytics.com https://*.analytics.google.com https://*.googletagmanager.com fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/mathjax@3.2.2/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self' https://checkout.stripe.com; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ https://*.google-analytics.com https://*.googletagmanager.com *.fastly-insights.com *.ethicalads.io; script-src 'self' https://*.googletagmanager.com https://www.google-analytics.com https://ssl.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/mathjax@3.2.2/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "4oVMDHb/pnY9+7kAW0msjg"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/expose-localhost
x-pypi-last-serial: 17357314
accept-ranges: bytes
date: Thu, 06 Apr 2023 12:57:34 GMT
fastly-debug-path: (D cache-bkk2310027-BKK 1680785854) (F cache-bkk2310026-BKK 1680785539) (D cache-iad-kcgs7200080-IAD 1680785539) (F cache-iad-kiad7000023-IAD 1680784292)
fastly-debug-ttl: (H cache-bkk2310027-BKK 86084.455 86400.000 316) (H cache-iad-kcgs7200080-IAD 85152.967 86400.000 1247)
fastly-debug-digest: d49e8de3eb792c7fd382a2cd2bfee2ada230a6affc998f25f70eeb1aada7d954
x-served-by: cache-iad-kcgs7200080-IAD, cache-bkk2310027-BKK
x-cache: HIT, HIT
x-cache-hits: 2, 2
x-timer: S1680785854.262743,VS0,VE0
vary: Accept-Encoding, Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 11771

di commented 1 year ago

Another report of a stale cache here: https://github.com/pypi/warehouse/pull/13222#issuecomment-1514732599

shankerwangmiao commented 1 year ago

Hi,

Can anyone have a look at the package cyaxares, the JSON api is giving stale serial again:

curl  -H 'Fastly-Debug: 1' -sq -I "https://pypi.org/pypi/cyaxares/json"
HTTP/2 200 
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ https://api.github.com/search/issues https://*.google-analytics.com https://*.analytics.google.com https://*.googletagmanager.com fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/mathjax@3.2.2/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self' https://checkout.stripe.com; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ https://*.google-analytics.com https://*.googletagmanager.com *.fastly-insights.com *.ethicalads.io; script-src 'self' https://*.googletagmanager.com https://www.google-analytics.com https://ssl.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/mathjax@3.2.2/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "I6nQY/xK6L6Fa0ceG9h+hw"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/cyaxares
x-pypi-last-serial: 18070352
accept-ranges: bytes
date: Tue, 06 Jun 2023 06:51:22 GMT
fastly-debug-path: (D cache-bkk2310030-BKK 1686034283) (F cache-bkk2310023-BKK 1686034283) (D cache-iad-kcgs7200148-IAD 1686034283) (F cache-iad-kjyo7100086-IAD 1685537219)
fastly-debug-ttl: (M cache-bkk2310030-BKK - - 0) (H cache-iad-kcgs7200148-IAD 21488.748 86400.000 64911)
fastly-debug-digest: 46e7843fbd1ca1439208c2876738b01e4b0d6f2a454be0efbe3d43964c1d42c7
x-served-by: cache-iad-kcgs7200148-IAD, cache-bkk2310030-BKK
x-cache: HIT, MISS
x-cache-hits: 39, 0
x-timer: S1686034283.603500,VS0,VE295
vary: Accept-Encoding, Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 5098

Expected serial should be 18070374 but got 18070352 actually.

@EWDurbin 2023-06-13: Another instance of #12933

ewdurbin commented 1 year ago

Hi. I'm going to be taking a closer look at this issue again tomorrow with Fastly.

I haven't seen reports unrelated to #12933 here since February. If you have reports of stale caches (not stuck serials) Please let me know!

ewdurbin commented 1 year ago

As far as I am currently aware, we are only seeing stuck/stale stuff due to #12933, which #13936 may resolve.

If we are not seeing fastly cache issues any longer, it may be that https://github.com/pypi/warehouse/blame/74d3a4335acc3840f3f718554c7868d704a322eb/warehouse/cache/origin/fastly.py#L131 resolved the issue?

mgroth0 commented 1 year ago

Hi, I'm a developer and some of my workflows involve releasing a new version of my pypi package and then subsequently using that version. Lately I've been making the following observations:

If I update my pypi package and then shortly after try to install that new version with pip install mypackage=x.x.x, the first time it does not work. The second time, it does work.
If I update my pypi package and then check https://pypi.org/pypi/mypackage/json, the first time I load that URL it doesn't include the newest version. The second time, it does.

I came here to report this issue, and quickly found this thread. I'm not knowledgeable about some of the terms in this issue, such as "fastly cache", but I get the feeling that my issue is likely the same as this issue. Given that, I though I should comment here before making a new issue.

So I suspect you all are already aware and working on this. But in the meantime, is there any recommended workaround to ensure that the metadata I get is always valid? Ideally a workaaround that works both for pip and for reading the raw output of the json api directly. I read through https://warehouse.pypa.io/api-reference/json.html but didn't find anything helpful. In this thread people are talking about things that I'm curious about such as -H 'Fastly-Debug: 1' --compressed but I'm not sure what they mean and I'm wondering if those are supposed to ensure that the metadata is valid and up to date?

chriseldredge commented 1 year ago

In the last week, I've seen spanishconjugator run into issues multiple times, along with some other one-offs:

2023-06-14 08:23:01.905 essenceapi
2023-06-14 08:59:29.640 willow
2023-06-15 00:46:11.293 openet-sims
2023-06-15 14:33:11.496 spanishconjugator
2023-06-17 00:34:55.405 pepperize-cdk-github
2023-06-19 20:32:14.270 spanishconjugator

pypi / warehouse

PyPI serving stale content from JSON/Simple API #12214