Investigate cache issues in production

diox commented 3 years ago

We deployed 2020.12.03 yesterday as normal, expecting the cache added for the Collections Add-ons API endpoint to help with the scaling issues we noticed recently.

Unfortunately, that did not work as expected. We can see on this graph the endpoint remains slow: Screenshot_2020-12-04 AMO Prod frontend APIs usage performance - Grafana

When testing manually via curl, we can see we're being served a cache response when the Expires points to a date that's less than one hour in the future. Here is what it can look like on stage, where it does work as expected:

% curl -s https://addons.allizom.org/api/v4/accounts/account/4760859/collections/f2b87cb7-9a00-1c0d-64d0-0911e7/addons/ -I
# Wait 2 minutes...
% curl -s https://addons.allizom.org/api/v4/accounts/account/4760859/collections/f2b87cb7-9a00-1c0d-64d0-0911e7/addons/ -I | egrep 'Date|Expires'
Date: Fri, 04 Dec 2020 10:46:46 GMT
Expires: Fri, 04 Dec 2020 11:44:21 GMT

We did see that happening in prod, but not often as we'd like. Most of the time we appear to be served fresh content. We asked Wei for some sample URLs from nginx and they all looked normal.

"GET /api/v4/accounts/account/mozilla/collections/7dfae8669acc4312a65e8ba5553036/addons/?page_size=50&sort=-popularity HTTP/1.1" 200 210472 "-" "Mozilla/5.0 (Android 10; Mobile; rv:83.0) Gecko/83.0 Firefox/83.0"
"GET /api/v4/accounts/account/mozilla/collections/7dfae8669acc4312a65e8ba5553036/addons/?page_size=50&sort=-popularity HTTP/1.1" 200 210339 "-" "Mozilla/5.0 (Android 9; Mobile; rv:83.0) Gecko/83.0 Firefox/83.0"
"GET /api/v4/accounts/account/mozilla/collections/7dfae8669acc4312a65e8ba5553036/addons/?page_size=50&sort=-popularity HTTP/1.1" 200 210339 "-" "Mozilla/5.0 (Android 8.1.0; Mobile; rv:83.0) Gecko/83.0 Firefox/83.0"
"GET /api/v4/accounts/account/mozilla/collections/7dfae8669acc4312a65e8ba5553036/addons/?page_size=50&sort=-popularity HTTP/1.1" 200 210447 "-" "Mozilla/5.0 (Android 10; Mobile; rv:83.0) Gecko/83.0 Firefox/83.0"
"GET /api/v4/accounts/account/mozilla/collections/7dfae8669acc4312a65e8ba5553036/addons/?page_size=50&sort=-popularity HTTP/1.1" 200 210447 "-" "Mozilla/5.0 (Android 10; Mobile; rv:83.0) Gecko/83.0 Firefox/83.0"
"GET /api/v4/accounts/account/mozilla/collections/7dfae8669acc4312a65e8ba5553036/addons/?page_size=50&sort=-popularity HTTP/1.1" 200 210315 "-" "Mozilla/5.0 (Android 10; Mobile; rv:83.0) Gecko/83.0 Firefox/83.0"
"GET /api/v4/accounts/account/mozilla/collections/7dfae8669acc4312a65e8ba5553036/addons/?page_size=50&sort=-popularity HTTP/1.1" 200 210315 "-" "Mozilla/5.0 (Android 10; Mobile; rv:83.0) Gecko/83.0 Firefox/83.0"
"GET /api/v4/accounts/account/mozilla/collections/7dfae8669acc4312a65e8ba5553036/addons/?page_size=50&sort=-popularity HTTP/1.1" 200 210321 "-" "Mozilla/5.0 (Android 5.0.1; Mobile; rv:83.0) Gecko/83.0 Firefox/83.0"

There are some minor variations in the User-Agent, which would cause the cache key to be different (we Vary on it, which we shouldn't do but that's a topic for another day), but not that many.

Because local envs, dev & stage appear to be working normally, we suspect there is something wrong with prod.

diox commented 3 years ago

Digging into memcached graphs, we can see:

Increased CPU usage & connections on memcached servers

Screenshot_2020-12-04 amo by wezhou - Grafana(1)

IMHO, this is expected: we weren't using memcached intensively before, and now we're using it for an endpoint that receives a lot of traffic. It's the entire point. That shouldn't be a problem.

Cache misses higher than cache hits

We never had a very good cache hit ratio, again because we didn't cache a lot of things. But now, we'd expect the cache to be hit quite a lot, as we're caching the collection API for 1 hour, for a given set of query parameters + some specific headers. This is not what we're seeing on the graph: Screenshot_2020-12-04 amo by wezhou - Grafana(2)

Memory usage and number of items

It's a little hard to see on those screenshots, but the graphs show something weird: number of items and used/free memory on the 3 memcached nodes is wildly different. Our nodes are extremely overkill for our usage, but we should have relatively uniform distribution. What we see is that 003 has a lot more items and more used (and almost no free) memory: Screenshot_2020-12-04 amo by wezhou - Grafana(4)

Evictions and reclaimed space

Eviction graph confirms something weird is going on with node 0003. It keeps evicting (turquoise line) and reclaiming (blue spikey line) keys, especially after the push, way more than the other nodes.:

Screenshot_2020-12-04 amo by wezhou - Grafana(5)

Obviously if the node is almost at full capacity it makes sense it's trying to evict stuff to make some space, but why is distribution so skewed, and why is not eventually stabilizing ?

diox commented 3 years ago

Some things it'd be interesting to look at:

Find out the exact keys django is using for the cache we're interested in.
Compression should be enabled at the client level, so we should always be able to store the API responses we want to cache even though they are quite large. But it'd be interesting to see exactly how large they are.
Related to that, slab stats are likely going to be relevant. Maybe we're just storing too many things in a single slab somehow and that's causing our evictions.

diox commented 3 years ago

Related to that, slab stats are likely going to be relevant. Maybe we're just storing too many things in a single slab somehow and that's causing our evictions.

While we still don't know why the distribution is so skewed, I suspect the most likely scenario is that the slab(s) that we'd want to use for collections API responses caching on node 0003 are completely full and memcached is struggling to adapt to the new data that's all coming in to that/those slab(s). If that's the case, just restarting the node (pulling it out of the cluster first, which requires a redeploy to do cleanly I think) would be enough. If we go that route we might want to consider the lru_crawler and lru_maintainer parameters, which can help with this kind of situation.

Edit: 2 additional parameters that might help are slab_reassign=1 and slab_automove=1. From memcached docs:

Slab Reassign

Long running instances of memcached may run into an issue where all available memory has been assigned to a specific slab class (say items of roughly size 100 bytes). Later the application starts storing more of its data into a different slab class (items around 200 bytes). Memcached could not use the 100 byte chunks to satisfy the 200 byte requests, and thus you would be able to store very few 200 byte items.

The parameters slab_reassign and slab_automove allow manual and automatic reassignment of memory to help with that. See the docs for more info about the drawbacks.

bqbn commented 3 years ago

We rebooted node 0003 at around 11 am PST and will give it some time and see how it goes.

bqbn commented 3 years ago

So 5 hours after we rebooted 0003, the eviction situation seems to have improved.

But the hits vs misses hasn't improved,

The number of misses are still much more than the number of hits. I think that's probably why we still see high CPU usage in the web instances and in the database.

Almost all hits are GET hits, and misses are GET misses and DELETE missis (about half and half each).

diox commented 3 years ago

I've been investigating this and found something interesting.

We're using django's @cache_page decorator, applied to the list() method. So when we're calling list() because of a GET request on the API, we're supposed to check the cache first, return the cached response if any, or generate the response, store it in the cache, and return it as normal.

The "thing" we store in the cache is a full HttpResponse object, pickled. For regular django's HttpResponses most of the time it wouldn't matter that much - the response needs to be rendered first (which generates the _container attribute that is used by the content property), there would be some overhead but not that much. In our case though, it's a rest_framework's Response, not a django's HttpResponse. It initially contains a data attribute with a dictionary containing all the data to be returned, but once rendered it also contains the bytes content in the _container attribute. So the size of the pickled response is roughly twice as big as the returned content.

It matters here because the uncompressed size of the json is around 673K - because Fenix requests the collection without passing a language, so we return all known translations for those add-ons. So the pickled response is about 1.3 MB which is too big to store in memcached by default - the default limit is 1MB. python-memcached silently fails storing the data in memcached, and we never get a cache hit the next time around - the data has never been stored in the cache.

This would explain the cache hits/miss : while we do have a long tail of add-ons and collections, the hit/miss ratio was weird because it was so consistent. You'd expect that eventually, even with a super long tail, we'd end up successfully caching something and improving the ratio. But we likely aren't because the vast majority of cache requests are for that specific response we aren't managing to store in the cache.

The one thing this theory does not explain though, is that we've seen cached responses from prod a few times, so there is still something we are missing.

diox commented 3 years ago

The one thing this theory does not explain though, is that we've seen cached responses from prod a few times, so there is still something we are missing.

I think that's because when we tested, we looked at collections in general, not the specific one Fenix is using, so it worked (sometimes, when the data didn't get evicted from the cache) because the collection used wasn't as large.

diox commented 3 years ago

QA: try https://github.com/mozilla/addons/issues/8086 again, especially with collections with lots of add-ons (which themselves contain lots of data if possible, we want to test with larger size responses - large translations in multiple languages is the best way to bump up the size)

AlexandraMoga commented 3 years ago

@diox I've created two large collections [1] [2] on -dev and tried o include add-ons with multiple translations as much as possible.

For the caching part, I have the same conclusions as in https://github.com/mozilla/addons/issues/8086

I've also monitored the response times and these are the numbers that I obtained:

https://addons-dev.allizom.org/api/v4/accounts/account/10642146/collections/collection-with-restricted-add/addons/

Response times at application start time (in my case the request is made from Postman) - no cache: between 1300 - 1800 ms
Response times when repeating the same request (sort params not added/changed): between 200 and 700 ms; ~400 ms was the most common
Response time when changing sort parameters: between 500 - 900 ms

https://addons-dev.allizom.org/api/v4/accounts/account/10642146/collections/Large-collection-2/addons/

Response times at application start time (in my case the request is made from Postman) - no cache: between 1200 - 1600 ms
Response times when repeating the same request (sort params not added/changed): between 200 and 700 ms; ~400 ms was the most common
Response time when changing sort parameters: between 700 - 900 ms

If the data used is still not large enough, please let me know and I will bump the translation sizes some more.

diox commented 3 years ago

That looks ok. The first collection is right around the size we want to check, and the numbers do indicate caching is working so that's good. Maybe bump the size slightly (if you download the json it's currently at 507K, having it at 600K would be closer to production and give us slightly more confidence) and retest.

AlexandraMoga commented 3 years ago

I've tried to increase the collection size a bit and repeated the API request while also monitoring timings. The response times were still in the range mentioned in the comment above.

mozilla / addons