Add caching to code endpoint

vednoc commented 1 year ago

Describe the change you'd like:

This endpoint is roughly 2/3 of total memory allocations^1, and it would be great to optimize it.

Additional context:

We had an LRU cache in front of this endpoint until we removed it in https://github.com/userstyles-world/userstyles.world/commit/7b8d78d6f7ebebb233890ec7a29104f4100ba556 due to a memory leak. A bit over two weeks ago, we switched to serving files instead of running database queries in https://github.com/userstyles-world/userstyles.world/commit/91041a77965d78edcc734ff189ebce64b884e75a. While that's a lot simpler and should lessen the load on our database, it ended up causing GC to run from every 70–75 seconds to every 40–45 seconds. It looks like some of our database queries were being cached, which resulted in faster response and less memory allocations.

Unfortunately, I can't seem to find a flamegraph from some months ago, but I do remember that GC used to run on every 100-ish seconds. The goal is to bring that number up, as far as possible, to the maximum that's 120 seconds. I've been working on improving overall performance in between the changes, which makes it hard to say with pinpoint accuracy how much of an effect they had. Most of them were also related to resolving the regular CPU spikes^2. They are no longer an issue now ^3.

Back in February, as well as before that, our memory usage was pretty high. The earliest records in Grafana show 270MB, back before LRU cache was removed on 5th of March^4. There were times it was even higher, but I went ahead and set GOMEMLIMIT (a feature of Go 1.19) to help keep it under control. In the last month, our RAM usage has dropped down significantly^5. Also, we received a lot of deliberate spam, which still continues to this day, but I've taken care of it in a way that doesn't even reach USw's process.

The idea is for GC to run less often and collect less garbage^6. An LRU cache doesn't seem optimal for this role, but it's easy to implement and a pretty decent start nonetheless. An LFU cache looks like a better fit, though there are also other other caches we could utilize. I've started writing my own LRU cache, and I'll write an LFU cache as well if LRU cache isn't sufficient. Moreover, I've also tried a couple of third-party LRU- and LFU-alike caches, which resulted in undesired behavior that made me question my sanity.

vednoc commented 1 year ago

18 hours after deploying this commit, GC runs on an average of 98 seconds and the initial flamegraph^1 sees pretty decent drop in total memory allocations for code endpoint^2, going from 65% down to 50%. We started with 100 items in the cache, which I plan to double or even triple over the next week or so. Memory usage has slightly increased as well^3, and will continue to increase with more cached items, but that's a good trade-off as long as the total memory allocations continue to go down. Also, it looks like the GC is running less often and collecting more garbage, which is why the memory usage chart looks spikier than before^4. I'm still learning and getting a feel for how everything works, so my interpretation could be missing the mark.

vednoc commented 1 year ago

19 hours after increasing cache size to 200 items, GC runs on an average of 108 seconds and total memory allocations for code endpoint dropped once again^1, going from 50% down to 37.5%. It was closer to 35% early on. Next up, I'll increase the cache to 300 items. Memory usage remained roughly around the same^2. However, we can now monitor the size of this cache^3. We didn't have that during the first run with 100 items, but it will provide helpful insights as we continue increasing its size.

vednoc commented 1 year ago

The result after 19 hours with 300 items in the cache looks decent^1. Total allocations went down from 37.5% to 31.8%, average GC went up by another 3 seconds since last time, to 111 seconds. Memory usage is more-or-less the same, and the cache size has increased^2. Of course, metrics from production can vary a lot. However, the pattern is pretty clear. Also, it does look like diminishing returns have begun. I'll stop here for now and shift my focus to other areas that deserve some attention.

userstyles-world / userstyles.world

Add caching to code endpoint #234