Closed diox closed 3 years ago
Digging into memcached graphs, we can see:
IMHO, this is expected: we weren't using memcached intensively before, and now we're using it for an endpoint that receives a lot of traffic. It's the entire point. That shouldn't be a problem.
We never had a very good cache hit ratio, again because we didn't cache a lot of things. But now, we'd expect the cache to be hit quite a lot, as we're caching the collection API for 1 hour, for a given set of query parameters + some specific headers. This is not what we're seeing on the graph:
It's a little hard to see on those screenshots, but the graphs show something weird: number of items and used/free memory on the 3 memcached nodes is wildly different. Our nodes are extremely overkill for our usage, but we should have relatively uniform distribution. What we see is that 003 has a lot more items and more used (and almost no free) memory:
Eviction graph confirms something weird is going on with node 0003. It keeps evicting (turquoise line) and reclaiming (blue spikey line) keys, especially after the push, way more than the other nodes.:
Obviously if the node is almost at full capacity it makes sense it's trying to evict stuff to make some space, but why is distribution so skewed, and why is not eventually stabilizing ?
Some things it'd be interesting to look at:
Related to that, slab stats are likely going to be relevant. Maybe we're just storing too many things in a single slab somehow and that's causing our evictions.
While we still don't know why the distribution is so skewed, I suspect the most likely scenario is that the slab(s) that we'd want to use for collections API responses caching on node 0003 are completely full and memcached is struggling to adapt to the new data that's all coming in to that/those slab(s). If that's the case, just restarting the node (pulling it out of the cluster first, which requires a redeploy to do cleanly I think) would be enough. If we go that route we might want to consider the lru_crawler
and lru_maintainer
parameters, which can help with this kind of situation.
Edit: 2 additional parameters that might help are slab_reassign=1
and slab_automove=1
. From memcached docs:
Slab Reassign
Long running instances of memcached may run into an issue where all available memory has been assigned to a specific slab class (say items of roughly size 100 bytes). Later the application starts storing more of its data into a different slab class (items around 200 bytes). Memcached could not use the 100 byte chunks to satisfy the 200 byte requests, and thus you would be able to store very few 200 byte items.
The parameters slab_reassign
and slab_automove
allow manual and automatic reassignment of memory to help with that. See the docs for more info about the drawbacks.
We rebooted node 0003 at around 11 am PST and will give it some time and see how it goes.
So 5 hours after we rebooted 0003, the eviction situation seems to have improved.
But the hits vs misses hasn't improved,
The number of misses are still much more than the number of hits. I think that's probably why we still see high CPU usage in the web instances and in the database.
Almost all hits are GET hits, and misses are GET misses and DELETE missis (about half and half each).
I've been investigating this and found something interesting.
We're using django's @cache_page
decorator, applied to the list()
method. So when we're calling list()
because of a GET
request on the API, we're supposed to check the cache first, return the cached response if any, or generate the response, store it in the cache, and return it as normal.
The "thing" we store in the cache is a full HttpResponse
object, pickled. For regular django's HttpResponse
s most of the time it wouldn't matter that much - the response needs to be rendered first (which generates the _container
attribute that is used by the content
property), there would be some overhead but not that much. In our case though, it's a rest_framework's Response
, not a django's HttpResponse
. It initially contains a data
attribute with a dictionary containing all the data to be returned, but once rendered it also contains the bytes content in the _container
attribute. So the size of the pickled response is roughly twice as big as the returned content.
It matters here because the uncompressed size of the json is around 673K - because Fenix requests the collection without passing a language, so we return all known translations for those add-ons. So the pickled response is about 1.3 MB which is too big to store in memcached by default - the default limit is 1MB. python-memcached
silently fails storing the data in memcached, and we never get a cache hit the next time around - the data has never been stored in the cache.
This would explain the cache hits/miss : while we do have a long tail of add-ons and collections, the hit/miss ratio was weird because it was so consistent. You'd expect that eventually, even with a super long tail, we'd end up successfully caching something and improving the ratio. But we likely aren't because the vast majority of cache requests are for that specific response we aren't managing to store in the cache.
The one thing this theory does not explain though, is that we've seen cached responses from prod a few times, so there is still something we are missing.
The one thing this theory does not explain though, is that we've seen cached responses from prod a few times, so there is still something we are missing.
I think that's because when we tested, we looked at collections in general, not the specific one Fenix is using, so it worked (sometimes, when the data didn't get evicted from the cache) because the collection used wasn't as large.
QA: try https://github.com/mozilla/addons/issues/8086 again, especially with collections with lots of add-ons (which themselves contain lots of data if possible, we want to test with larger size responses - large translations in multiple languages is the best way to bump up the size)
@diox I've created two large collections [1] [2] on -dev and tried o include add-ons with multiple translations as much as possible.
For the caching part, I have the same conclusions as in https://github.com/mozilla/addons/issues/8086
I've also monitored the response times and these are the numbers that I obtained:
If the data used is still not large enough, please let me know and I will bump the translation sizes some more.
That looks ok. The first collection is right around the size we want to check, and the numbers do indicate caching is working so that's good. Maybe bump the size slightly (if you download the json it's currently at 507K, having it at 600K would be closer to production and give us slightly more confidence) and retest.
I've tried to increase the collection size a bit and repeated the API request while also monitoring timings. The response times were still in the range mentioned in the comment above.
We deployed 2020.12.03 yesterday as normal, expecting the cache added for the Collections Add-ons API endpoint to help with the scaling issues we noticed recently.
Unfortunately, that did not work as expected. We can see on this graph the endpoint remains slow:
When testing manually via
curl
, we can see we're being served a cache response when theExpires
points to a date that's less than one hour in the future. Here is what it can look like on stage, where it does work as expected:We did see that happening in prod, but not often as we'd like. Most of the time we appear to be served fresh content. We asked Wei for some sample URLs from nginx and they all looked normal.
There are some minor variations in the
User-Agent
, which would cause the cache key to be different (weVary
on it, which we shouldn't do but that's a topic for another day), but not that many.Because local envs, dev & stage appear to be working normally, we suspect there is something wrong with prod.