Open chriseldredge opened 2 years ago
Okay, we have now rolled out the workaround provided by Fastly.
I'd really appreciate updates on instances of stale caches over the next week or so!
Expected PyPI serial 16729491 for request https://pypi.org/pypi/cfy-lint/json but got 16199015.
since 2023-02-07 15:53:58
in UTC
I've tried some curl
, finding that stale serial is always provided with --compress
specified, but clean data comes at a chance (for each same IP) without that. Did the workaround took effect on all edges and both v4&v6?
It seems we had a flurry of failed purge tasks in the first day after rolling out the work around
I've confirmed that one of those failures was for cfy-lint
, but it should have been retried.
The failures have gone away.
Two things to look at,
On second look it appears that the tasks should have been retried... but we weren't catching them in metrics. #12968 will help us track this down if purge failures occur again.
The failed purges were due to some eventual consistency that should have settled. We haven't seen any failed purge tasks in our logs in the past 48 hours. @chriseldredge @TechCiel can you provide updates on what you're seeing for stale content?
Hi, @ewdurbin , glad to hear that! Had a look on the log, the last long-lasting error was on serotiny
, resolved after Feb 7 23:21 UTC.
Some stale content still appears, but quickly got purged after a few retries without causing any problem. I wonder if that was because I kept the pace too close, just between the update & purge.
I've disabled my "check uncompressed" workaround and will keep monitoring the status. I'll let you know if the mirror is broken. Thanks all!
Just hours after switching off the "check uncompressed" workaround, the mianxiang-wangluo-zhuanjia-de-linux-jifan (16801627)
gone brrrrrr.... Re-enabling the workaround fixed the problem, maybe just this time, though.
Package maturin
gone stale about 3 hours ago. As I check the recent logs, I've found some outburst of stale content.
I've paste the result of grep 'Giving up' pypi-*.log
here: https://pastebin.com/BhzZGxd7
Please note the log is in UTC+8. Sadly, this seems like we're still far from actually fixing that problem.
Thanks @TechCiel, we're continuing to follow up with Fastly support.
@chriseldredge do you have any updates?
Hi @ewdurbin, unfortunately it looks like the mitigations have not addressed the stale cache bug in a meaningful way. It may be coincidence but anecdotally things actually look worse on some recent days than they did in January.
Here's a breakdown of the count of times my replicator gave up after waiting 10 minutes for the serial to agree on the package info page with what the XML-RPC api said was highest serial at the time:
1 2023-01-28
1 2023-01-29
3 2023-02-06
13 2023-01-24
13 2023-01-26
14 2023-02-01
26 2023-01-30
53 2023-02-02
81 2023-02-09
82 2023-02-08
97 2023-02-07
101 2023-02-12
101 2023-02-16
109 2023-02-11
160 2023-02-10
210 2023-02-13
318 2023-02-15
909 2023-02-14
Valentines day was rough!
Oof, that's not looking great at all. @chriseldredge can you help us understand what those numbers mean? Are these all from a single "replicator" and do they include repeats/retries?
We're gonna keep working with Fastly to see if there's something else we can do 😓
Those are counts of the number of times my replicator service gave up after 10 minutes waiting for the package info page to be up to date with xmlrpc. There are many duplicates since sometimes the cache purges appear to fail around the globe. If it is useful, please let me know more specifically what data and format would help you continue to investigate the issue.
I'll reiterate that the problem appears to affect packages with a high number of releases (e.g. checkov), which makes me wonder if the origin is not actually completing whatever asynchronous workflows prior to the cache purge being requested. In other words, if the purge happens before the next call to the package page would return up to date information, the race condition is in the origin.
There's no async workflows in uploading a package and having the origin serve that currently.
There's a single http handler that makes changes to the database, commits it, then fires off a purge task.
There's another http handler that reads committed things from the database to serve the simple pages.
It shouldn't be possible for the purge task to fire until after the database has committed it, at which point the origin should already be serving the updated content.
Just to document this all here, I've been doing some thinking on this problem to try and figure out if we can do something to fix or mitigate it.
I've yet to come up with a reasonable way for us to fix this on our end unless there's some combination of Fastly features that we're unaware of that would make it work, and so far Fastly hasn't been able to get us a resolution either. That pretty much leaves us with what we can do to mitigate the issue, which I think maybe I've come up with a workable strategy, but it requires some subtle changes to how PyPI functions so I figured I'd lay it out here and see what people think.
Essentially this problem boils down to the fact that Fastly + Warehouse form a complex and large distributed system where it is always going to be some levels of eventually consistent. The hope was that by using purging we could force the CDN to become consistent "right now", which meant that we could tune our parameters so that eventually consistent could be for quite a long time. Unfortunately for whatever reason, purging has not been reliable enough for this use case, which means that we're falling back to "quite a long time" regularly which breaks mirroring until the service becomes consistent again (eventually!).
So the basic thrust of our mitigation here is that we retune our TTL to drop it down from 24 hours to a value that is more reasonable, maybe 10 minutes or 15 minutes, which means that we should expect the system to self heal and become consistent on it's own even without purging in 10-15 minutes (or maybe 20-30 minutes with shielding? I'm not sure offhand).
On it's own that would mean that the number of requests that actually hit the origin servers will go up by quite a lot, each cached request will go from hitting the origin servers every 24h to every 10-30 minutes, or like a ~150x worst case increased number of requests to the origin servers for some of our heaviest views (if I recall at least).
Fastly does have a feature to mitigate the impact of these (which we already employ) where rather than just blindly fetching a whole new response after the existing object expires, it will instead attempt to re-validate its existing object using a conditional HTTP request, where it will say "Hey I want X, but only if the ETag
is no longer Z
", which Warehouse will look at and go "hey it hasn't changed", 304 Not Modified
with an empty response body to tell Fastly to use it's existing object and treat it as new, or "it's changed, 200 OK
" with the full response.
Unfortunately the way that Warehouse works is that to process a conditional http request it first runs the view code and does all of the work to generate a response, then at the last minute it looks to see if the request and response have matching ETags
, and if they do it throws away the response it just generated and returns an empty 304 Not Modified
, but it still had to do all of the work to actually generate that response it threw away.
Fundamentally there's no reason why you have to generate the response prior to processing the conditional request, except that the way Warehouse handles generating an ETag
is to generate a md5 hash of the response body [^1] because if you want to generically generate an ETag
, there's not really a better way to do it [^2]. That means that as implemented, we can't know what the ETag
of a response is going to be, without generating the response.
However, if we had a lightweight way to predetermine what the ETag
was going to be, we could process the conditional request up front and skip generating the response body if we're going to return a 304 Not Modified
. As it so happens, for the pages that we really care about here, we could just use the last serial as our ETag
(or derive it from last serial somehow) which we do have a lightweight way to look up what our last serial for a project is (it's a column off the projects
table).
That would be great, except that there's another wrinkle. We have middleware in Warehouse that handles compression so that we can safely and intelligently have compression enabled without falling victim to CRIME and also reduces our egress to Fastly (at the cost of the CPU overhead to compress).
Obviously to compress the response body we have to have the response body, so that middleware doesn't function without having the full response body. That middleware also has to implement it's own ETag
scheme because the ETag
for a compressed and uncompressed response needs to differ.
The simplest thing to do would be to remove compression from Warehouse and shift that to happen in the Fastly layer (https://github.com/pypi/infra/pull/120), unfortunately that means that our egress costs would end up going up. If we did that, then we could have Warehouse short circuit on conditional requests, which would hopefully mean that we could reduce our TTL and make this whole problem a lot more self healing.
I'm experimenting with options to keep the compression in Warehouse and still allow this short circuiting on conditional requests behavior.
One option is to turn our responses into streaming responses, but Pyramid can sometimes implicitly consume the streaming response and make it non-streaming if anything attempts to access the response body. Since that happens implciitly it means we wouldn't short circuit if anything caused that to happen but it would be non obvious that it was happening, so it would end up being fairly fragile I think.
I'll see if I can come up with any other options though.
[^1]: Fun fact, this is why last serial is included in those response bodies, because otherwise changes that incremented the serial but didn't affect the response body would mean that Fastly would resurrect the old cached object, with the old last serial header, so we had to force the request body to change for every last serial increment. [^2]: To really be accurate, you'd also want to hash some or most of the response headers as well, but just doing the body gets you 95% of the way there.
It seems to me that your Fastly cache invalidation does work most of the time, and the cases where my replicator runs into trouble are generally identical to a case I observed earlier today.
My replicator running in us-east-1 got stuck processing changes to spanishconjugator
:
2023-02-23T09:25:15Z Expected package spanishconjugator refresh serial to be greater or equal to 16990807 but was 16990805
Looking at relevant entries in the Warehouse xmlrpc changelog_since_serial
api:
Thu Feb 23 09:24:19 2023 spanishconjugator 2.3.4236 serial=16990804 action=new release
Thu Feb 23 09:24:19 2023 spanishconjugator 2.3.4236 serial=16990805 action=add py3 file spanishconjugator-2.3.4236-py3-none-any.whl
Thu Feb 23 09:24:21 2023 spanishconjugator 2.3.4236 serial=16990807 action=add source file spanishconjugator-2.3.4236.tar.gz
This is similar to all cases I've investigated. It looks like the cache invalidation manages to nudge Fastly to refresh the package pages the first and maybe second time (I don't know off hand if "new release" triggers invalidation). But two seconds after the py3 file is added, the source file gets added. This is the change that appears to fail to propagate some of the time through Fastly.
Supposing that the origin Warehouse system appears consistent in the absence of a CDN, e.g. that information that appears in changelog_since_serial
responses is never newer than what appears in the simple/json APIs for a given package, then it seems we can narrow down the scope of this problem to how Fastly works in a distributed network when there are overlapping activities. Hypothetically, here's a sequence that could explain what we see:
The precise order may not matter, with the key concept being that the final cache invalidation instruction gets processed before an in-flight request/response completes.
I don't have familiarity of the timing of cache invalidation requests sent to Fastly, but if this hypothesis is correct then it seems like debouncing invalidation requests could resolve the issue. In the above case, the two events came within 2 seconds of each other. A debounce threshold of 5 or 10 seconds could be adequate. This can also happen when a maintainer publishes more than one version of their package in quick succession, so the debouncing needs to be aggregated at the package level, not at a release version level.
Thanks @chriseldredge, that aligns with what I suspect is happening as well. @ewdurbin and I are roughing out an approach to batch purge multiple unique keys with a short delay (essentially, debouncing) to see if that mitigates the issue.
expose-localhost
has been staled since 0:22 UTC.
expose-localhost
has been staled since 0:22 UTC.
Is there anyone looking into this matter? This incident has not been resolved since 8 days ago.
Hi, can anyone have a look at the package expose-localhost
, it seems that the JSON api is giving out stale response again since about 2 weeks ago.
2023-04-06 12:26:33,912 ERROR: Error syncing package: expose-localhost@17357317 (mirror.py:369)
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/bandersnatch/package.py", line 61, in update_metadata
self._metadata = await master.get_package_metadata(
File "/usr/local/lib/python3.9/site-packages/bandersnatch/master.py", line 240, in get_package_metadata
metadata_response = await metadata_generator.asend(None)
File "/usr/local/lib/python3.9/site-packages/bandersnatch/master.py", line 158, in get
await self.check_for_stale_cache(path, required_serial, got_serial)
File "/usr/local/lib/python3.9/site-packages/bandersnatch/master.py", line 137, in check_for_stale_cache
raise StalePage(
bandersnatch.master.StalePage: Expected PyPI serial 17357317 for request https://pypi.org//pypi/expose-localhost/json but got 17357314. HTTP PURGE has been issued to the request url
expose-localhost should be consistent again: https://github.com/pypi/warehouse/issues/12933#issuecomment-1499006912
expose-localhost should be consistent again: #12933 (comment)
Hi, many thanks for your reply. However, I failed to see the change on my side. I wonder if purging the CDN is needed.
# curl -H 'Fastly-Debug: 1' -sq -I "https://pypi.org//pypi/expose-localhost/json"
HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ https://api.github.com/search/issues https://*.google-analytics.com https://*.analytics.google.com https://*.googletagmanager.com fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/mathjax@3.2.2/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self' https://checkout.stripe.com; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ https://*.google-analytics.com https://*.googletagmanager.com *.fastly-insights.com *.ethicalads.io; script-src 'self' https://*.googletagmanager.com https://www.google-analytics.com https://ssl.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/mathjax@3.2.2/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "4oVMDHb/pnY9+7kAW0msjg"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/expose-localhost
x-pypi-last-serial: 17357314
accept-ranges: bytes
date: Thu, 06 Apr 2023 12:57:34 GMT
fastly-debug-path: (D cache-bkk2310027-BKK 1680785854) (F cache-bkk2310026-BKK 1680785539) (D cache-iad-kcgs7200080-IAD 1680785539) (F cache-iad-kiad7000023-IAD 1680784292)
fastly-debug-ttl: (H cache-bkk2310027-BKK 86084.455 86400.000 316) (H cache-iad-kcgs7200080-IAD 85152.967 86400.000 1247)
fastly-debug-digest: d49e8de3eb792c7fd382a2cd2bfee2ada230a6affc998f25f70eeb1aada7d954
x-served-by: cache-iad-kcgs7200080-IAD, cache-bkk2310027-BKK
x-cache: HIT, HIT
x-cache-hits: 2, 2
x-timer: S1680785854.262743,VS0,VE0
vary: Accept-Encoding, Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 11771
Another report of a stale cache here: https://github.com/pypi/warehouse/pull/13222#issuecomment-1514732599
Hi,
Can anyone have a look at the package cyaxares
, the JSON api is giving stale serial again:
curl -H 'Fastly-Debug: 1' -sq -I "https://pypi.org/pypi/cyaxares/json"
HTTP/2 200
access-control-allow-headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: X-PyPI-Last-Serial
access-control-max-age: 86400
cache-control: max-age=900, public
content-security-policy: base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ https://api.github.com/search/issues https://*.google-analytics.com https://*.analytics.google.com https://*.googletagmanager.com fastly-insights.com *.fastly-insights.com *.ethicalads.io https://api.pwnedpasswords.com https://cdn.jsdelivr.net/npm/mathjax@3.2.2/es5/sre/mathmaps/ https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self' https://checkout.stripe.com; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ https://*.google-analytics.com https://*.googletagmanager.com *.fastly-insights.com *.ethicalads.io; script-src 'self' https://*.googletagmanager.com https://www.google-analytics.com https://ssl.google-analytics.com *.fastly-insights.com *.ethicalads.io 'sha256-U3hKDidudIaxBDEzwGJApJgPEf2mWk6cfMWghrAa6i0=' https://cdn.jsdelivr.net/npm/mathjax@3.2.2/ 'sha256-1CldwzdEg2k1wTmf7s5RWVd7NMXI/7nxxjJM2C4DqII=' 'sha256-0POaN8stWYQxhzjKS+/eOfbbJ/u4YHO5ZagJvLpMypo='; style-src 'self' fonts.googleapis.com *.ethicalads.io 'sha256-2YHqZokjiizkHi1Zt+6ar0XJ0OeEy/egBnlm+MDMtrM=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-JLEjeN9e5dGsz5475WyRaoA4eQOdNPxDIeUhclnJDCE=' 'sha256-mQyxHEuwZJqpxCw3SLmc4YOySNKXunyu2Oiz1r3/wAE=' 'sha256-OCf+kv5Asiwp++8PIevKBYSgnNLNUZvxAp4a7wMLuKA=' 'sha256-h5LOiLhk6wiJrGsG5ItM0KimwzWQH/yAcmoJDJL//bY='; worker-src *.fastly-insights.com
content-type: application/json
etag: "I6nQY/xK6L6Fa0ceG9h+hw"
referrer-policy: origin-when-cross-origin
server: nginx/1.13.9
surrogate-control: max-age=86400, stale-while-revalidate=300, stale-if-error=86400
surrogate-key: all-legacy-json project-legacy-json project/cyaxares
x-pypi-last-serial: 18070352
accept-ranges: bytes
date: Tue, 06 Jun 2023 06:51:22 GMT
fastly-debug-path: (D cache-bkk2310030-BKK 1686034283) (F cache-bkk2310023-BKK 1686034283) (D cache-iad-kcgs7200148-IAD 1686034283) (F cache-iad-kjyo7100086-IAD 1685537219)
fastly-debug-ttl: (M cache-bkk2310030-BKK - - 0) (H cache-iad-kcgs7200148-IAD 21488.748 86400.000 64911)
fastly-debug-digest: 46e7843fbd1ca1439208c2876738b01e4b0d6f2a454be0efbe3d43964c1d42c7
x-served-by: cache-iad-kcgs7200148-IAD, cache-bkk2310030-BKK
x-cache: HIT, MISS
x-cache-hits: 39, 0
x-timer: S1686034283.603500,VS0,VE295
vary: Accept-Encoding, Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: deny
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
content-length: 5098
Expected serial should be 18070374 but got 18070352 actually.
@EWDurbin 2023-06-13: Another instance of #12933
Hi. I'm going to be taking a closer look at this issue again tomorrow with Fastly.
I haven't seen reports unrelated to #12933 here since February. If you have reports of stale caches (not stuck serials) Please let me know!
As far as I am currently aware, we are only seeing stuck/stale stuff due to #12933, which #13936 may resolve.
If we are not seeing fastly cache issues any longer, it may be that https://github.com/pypi/warehouse/blame/74d3a4335acc3840f3f718554c7868d704a322eb/warehouse/cache/origin/fastly.py#L131 resolved the issue?
Hi, I'm a developer and some of my workflows involve releasing a new version of my pypi package and then subsequently using that version. Lately I've been making the following observations:
pip install mypackage=x.x.x
, the first time it does not work. The second time, it does work.I came here to report this issue, and quickly found this thread. I'm not knowledgeable about some of the terms in this issue, such as "fastly cache", but I get the feeling that my issue is likely the same as this issue. Given that, I though I should comment here before making a new issue.
So I suspect you all are already aware and working on this. But in the meantime, is there any recommended workaround to ensure that the metadata I get is always valid? Ideally a workaaround that works both for pip
and for reading the raw output of the json api directly. I read through https://warehouse.pypa.io/api-reference/json.html but didn't find anything helpful. In this thread people are talking about things that I'm curious about such as -H 'Fastly-Debug: 1' --compressed
but I'm not sure what they mean and I'm wondering if those are supposed to ensure that the metadata is valid and up to date?
In the last week, I've seen spanishconjugator
run into issues multiple times, along with some other one-offs:
2023-06-14 08:23:01.905 essenceapi
2023-06-14 08:59:29.640 willow
2023-06-15 00:46:11.293 openet-sims
2023-06-15 14:33:11.496 spanishconjugator
2023-06-17 00:34:55.405 pepperize-cdk-github
2023-06-19 20:32:14.270 spanishconjugator
Likely related to #11936, I am seeing stale content served for the URL https://pypi.org/pypi/dbnd-postgres/json.
The following events show in
changes_since_serial
today:However this curl command shows the serial ID did not pick up the most recent event:
Output:
When I repeat this command without the
--compressed
flag I see the expected output:Output:
This indicates to me that cache invalidation did not work.
I see from recent GitHub issues this has been a recurring issue recently. It is painful since without manual intervention it won't self-resolve until the TTL of 24 hours expires.