PyPI stale caches - Githubissues

Xcelled commented 5 years ago

At work, we're using a slight modified older version of Bandersnatch to mirror PyPI. About 10 days ago the mirroring job started failing - a lot - because the PyPI cache we were hitting was stale and refusing to update. After 10 days of failure, (where we got behind by about 4,000 updates), I finally forced Bandersnatch to hit a different cache by changing some request headers.

All was well for about 24 hours, then that cache started having stale issues. I've since updated our Bandersnatch code to use a query parameter of the current timestamp in order to unconditionally bust the caches.

That's obviously a sub-optimal solution, but it got us back up and running. It also got me thinking. instead of just bailing after a certain number of stale page errors, what if Bandersnatch fell back to using the cache-bust hack? It could retry 3 times to see if the caches get updated, and if they don't, it would force the cache to be busted in order to retrieve the proper version.

I'm willing to make a PR for this (I'm trying to upgrade our Bandersnatch at work), but I didn't want to do so without a discussion first. Maybe there's a better way to do this?

cooperlees commented 5 years ago

Hi,

Thanks for reporting this, I have also seen this before. Let me get some other PyPA Warehouse people to read this before we rush to implement your suggestion, I also want to think about this a little more.

Can I ask:

Can you provide the log messages you get just so I can picture your stale cache errors better
What other modifications do you have locally - Do they make sense for others?

@ewdurbin @di + @dstufft - Do you have any better ideas at how Bandersnatch can handle stale caches these days with warehouse, the world has changed since this was last thought about.

Cooper

Xcelled commented 5 years ago

I can get the cache logs for you in a bit. The other modifications we made were just to upload the syncd packages to Amazon AWS. Can't imagine that's terribly useful to others.

di commented 5 years ago

Thanks for the report! The XML-RPC API is a special snowflake and handles caching a bit differently than everything else on PyPI, so it's possible this is occasionally failing and we are unaware.

I'm a bit surprised that adding a cache breaker URL param helps here, because it shouldn't have any effect on the cached response AFAIK.

@cooperlees I'm not too familiar with Bandersnatch, can we be specific about which method would be failing here? Is it updated_releases?

Xcelled commented 5 years ago

I can give a better report when I get back to my desk, but it's the call to fetch the package json that was failing. I believe the x-pypi-last-serial header was being cached improperly, as the fetched json had the new serial. Unfortunately bandersnatch checks the header.

Xcelled commented 5 years ago

Okay, here's the full report:

About 12 days ago, our mirror cron job started dying on every run. The logs all showed this exact message as causing the failure: StalePage: Expected PyPI serial 4378749 for request https://pypi.python.org/pypi/alchemize/json but got 4378699. After 10 days of the caches not updating, I started investigating. That's where things got weird.

When I hit https://pypi.python.org/pypi/alchemize/json from my machine, it returned the proper serial. Thinking it was perhaps a problem with the server running the cron job, I SSH'ed into the machine and ran $ curl -svL https://pypi.python.org/pypi/alchemize/json. That also returned the proper serial, though the script running at the same time, on the same machine was still getting the stale serial. I uttered a certain 3-letter acronym and retried this several times, with the same result.

At a loss, I opened the python console on the server, imported requests, issued the get call, and checked the headers. Lo and behold, there was the stale serial. I compared the response headers between requests and curl and noticed they hit different caches. Curl had X-Served-By: cache-iad2130-IAD, cache-dca17748-DCA but requests was X-Served-By: cache-iad2143-IAD, cache-dca17732-DCA. Further investigation proved that it was the IAD cache that was returning stale data, and, more interestingly, curl always hit iad2130 and requests always hit iad2143:

>>> for i in range (10): print requests.get('https://pypi.python.org/pypi/alchemize/json').headers['X-Served-By']
... 
cache-iad2143-IAD, cache-dca17748-DCA
cache-iad2143-IAD, cache-dca17739-DCA
cache-iad2143-IAD, cache-dca17722-DCA
cache-iad2143-IAD, cache-dca17730-DCA
cache-iad2143-IAD, cache-dca17751-DCA
cache-iad2143-IAD, cache-dca17722-DCA
cache-iad2143-IAD, cache-dca17739-DCA
cache-iad2143-IAD, cache-dca17731-DCA
cache-iad2143-IAD, cache-dca17738-DCA
cache-iad2143-IAD, cache-dca17742-DCA

I tried some other programs. curl: 2130. requests: 2143. wget: 2130. httpie: 2131. Only requests returned a bad serial, so the problem seemed to lie with the 2143 cache.

I figured there must be something in the http request that was determining which cache to hit. After comparing the GET that requests sent with the GET that curl sent, I noticed there were two differences: the UserAgent and the AcceptEncoding headers. I suspected useragent, naturally, but telling curl to use the same useragent as requests still caused it to hit 2130. I then tried the accept encoding:

$ curl -sL --head https://pypi.org/pypi/alchemize/json -H 'Accept-Encoding: gzip, deflate' | grep X-Served-By
X-Served-By: cache-iad2143-IAD, cache-dca17723-DCA
$ curl -sL --head https://pypi.org/pypi/alchemize/json | grep X-Served-By
X-Served-By: cache-iad2130-IAD, cache-dca17732-DCA

Boom. And furthermore, on the Python side:

requests.get('https://pypi.python.org/pypi/alchemize/json', headers={'Accept-Encoding': ''}).headers['X-Served-By']
'cache-iad2130-IAD, cache-dca17738-DCA'

I patched our bandersnatch code to use the empty accept-encoding header and reran the script. Success! It no longer complained about stale serials.

But then 24 hours later, the script started dying again with more stale page errors. Instead of trying to change the cache again, I just tweaked the line to this: r = self.session.get(path, timeout=self.timeout, params={'cache-bust': datetime.utcnow()}, headers={'Accept-Encoding': ''}, **kw). That fixed the problem and it hasn't failed since. I did confirm that adding the timestamp causes a cache miss to be reported.

cooperlees commented 5 years ago

Thanks for the excellent problem description and root causing. The people I tagged above can talk about the caching better, and hopefully give us a better solution.

@di - Bandersnatch only uses the following two XMLRPC calls:

list_pakages_with_serial()
changelog_since_last_serial(serial: int) Source: https://github.com/pypa/bandersnatch/blob/master/src/bandersnatch/master.py#L84

Ideally I’d love to get off XMLRPC as I stated @ PyCon. Can we resume the XMLRPC replacement work? I’m happy to help, but I lost movement with others.

Is there a cleaner workaround bandersnatch could do here or could we root cause the cache issues in the short term?

di commented 5 years ago

We occasionally see this (namely, the difference between a plain and gzip response), though not quite this bad or frequently. Unfortunately when this happens, it seems to be mostly out of our control and instead an issue with our CDN.

One thing Bandersnatch could do is purge the cache itself if it suspects the cache is stale, by issuing an XPURGE to the stale endpoint. This would clear both the plain and the gzipped cache.

Xcelled commented 5 years ago

@di can anyone purge the cache? If so I can add that to our codebase so we stop busting caches unneededly.

di commented 5 years ago

@Xcelled Yup, anyone can issue an XPURGE to any PyPI page or endpoint (except for XML-RPC).

Xcelled commented 5 years ago

I can make a PR for this as well.

cooperlees commented 5 years ago

Awesome. This sounds much cleaner. Once we merge this I’ll stop being lazy and release 3.0 to PyPI too. Thanks all!

dstufft commented 5 years ago

I would prefer it if clients did not purge PyPI itself. This should generally be treated as a bug in PyPI not something for bandersnatch to take care of.

Xcelled commented 5 years ago

@dstufft is client purging something we can do in the short term until PyPI is fixed? I'm just offering the perspective of a user here: we need bandersnatch to work and the cache problems are significant.

dstufft commented 5 years ago

Eh, I mean technically yes? My main concern with working around it in Bandersnatch is that we otherwise likely won’t even know it’s broken if Bandersnatch is just “fixing” It for us by issuing the purge. We’ll just have degraded performance on requests that we won’t otherwise notice.

On Oct 17, 2018, at 5:44 PM, Megan Ketelaar notifications@github.com wrote:

@dstufft https://github.com/dstufft is client purging something we can do in the short term until PyPI is fixed? I'm just offering the perspective of a user here: we need bandersnatch to work and the cache problems are significant.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pypa/bandersnatch/issues/56#issuecomment-430800685, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI6O6AXFH7x8DV2cPd2IM7ILvChaLRmks5ul6TWgaJpZM4XiXei.

cooperlees commented 5 years ago

I feel ya @dstufft - I’ve opened https://github.com/pypa/warehouse/issues/4892 so that we can follow up debugging and removing this.

Can we somehow get more logging about how often Bandersnatch is hitting this and then think about how from this we will try and debug why requests is hitting this ... Lets discuss on the warehouse issue and I’ll keep this open until we remove the #57 hack.

pypa / bandersnatch

PyPI stale caches #56