searcher: not respecting on disk cache limit

keegancsmith commented 2 years ago

There are instances of searcher being evicted at customers by kubernetes. We believe the root cause is searcher not respecting the cache limits. The purpose of this issue is to track the root causing of this, provide work arounds and ultimately resolve this issue.

Some related issues:

Gracefully handle limited shared cache space #627
searcher and symbols deployments should switch to statefulsets #32279
Unindexed search archive cache is not monitored #9796
Searcher pod getting evicted customer/issues/740

sourcegraph-bot-2 commented 2 years ago

Heads up @jjeffwarner - the "team/search-core" label was applied to this issue.

keegancsmith commented 2 years ago

I feel given the impact on prospects and customers, my familiarity with searcher and that I am on support oncall means I should take a look at this this week.

keegancsmith commented 2 years ago

Alright sorry about the delay, but finally looked into this. From what I can tell searcher/symbols is behaving mostly correctly. I've fixed some minor things, and added some extra metrics. Once the metrics are out I'll overhaul the dashboards soon for searcher.

I think this is a problem around documentation. Searcher will monitor the usage of the cache dir every 10s, and then evict if we are over the limit. The problem here is kubernetes is configured to evict if we go over the limit. This means we pretty much never get a chance to evict, since k8s will evict us given how we are configured:

https://github.com/sourcegraph/deploy-sourcegraph/blob/35a3f8b3d19f7770f4e7a00cc57369c6946c6445/base/searcher/searcher.Deployment.yaml#L33-L83

So we have a few directions to go in:

Adjust the implementation of diskcache to aim for x% full? (90%?)
Adjust the kubernetes implementation to not tell searcher and symbols to use the full size available, and instead a %.
Make diskcache ensure it never goes over.

I'm unsure which direction is more sensible. The k8s definition is quite nice since it doesn't require configuration. Not sure if you can do more complicated math in yaml, nor if you want to. So likely the better solution is to adjust how diskcache works.

Making sure diskcache never goes over is quite a bit of work and likely will fail in weird ways. Making it target 90% or something likely will be the easiest.

Additionally we need to document somewhere this is how it works. Not sure on the best place such that CE/admins learn about this.

Also cc @daxmc99 and @caugustus-sourcegraph for there opinions. They both have touched the above lines, which switched us to using the full space available rather than a hardcoded number.

caugustus-sourcegraph commented 2 years ago

I'm in favor of adjusting the Kubernetes implementation / recommendations to aim for a percentage (we should be able to calculate this but if not that's fine). My only concern here is whether that is sufficient - according to @ggilmore here, the gap needed is variable based on their largest repository size. Is that still accurate? Should we just increase the default gap even further to make this work out of the box for more customers?

daxmc99 commented 2 years ago

Adjust the implementation of diskcache to aim for x% full? (90%?)

I think this is the best approach. The default of 25G (with a limit of 26G, where we get evicted) seems like it should be enough. If the service is continually being evicted we could ask the customer to horizontally scale.

I think this is a problem around documentation. Searcher will monitor the usage of the cache dir every 10s, and then evict if we are over the limit.

Ephemeral storage also includes logs, writeable layers etc. K8s uses a scanning technique so its not the exact second we cross 26G.

I'm surprised we are still being evicted. Do we grow by more than 1G in 10 seconds? Would more breathing room between request and limit help?

Request is what we need to be scheduled Limit is how much we promise to not to exceed

This is why we set the cache to match the request and not the limit

keegancsmith commented 2 years ago

After reading the above, I'm leaning towards making diskcache smarter so we never go over the threshold.

Do we have a cluster right now I can inspect which has searcher getting evicted? I'd like to root cause properly, right now my investigations have been playing around with a live system and seeing what happens (but not reproducing) + reading code.

according to @ggilmore here, the gap needed is variable based on their largest repository size. Is that still accurate? Should we just increase the default gap even further to make this work out of the box for more customers?

Yes that is correct, and a good observation. When I initially set this stuff up a long time ago, I left quite a lot of breathing room (multiple GB). Given the large monorepo working copies we see in practice are in the single digit gigabytes, maybe the approach we should take is just leave 20gb breathing room?

Adjust the implementation of diskcache to aim for x% full? (90%?)

I think this is the best approach. The default of 25G (with a limit of 26G, where we get evicted) seems like it should be enough. If the service is continually being evicted we could ask the customer to horizontally scale.

Is that how much our default is? BTW this could be an issue if a customer has a large monorepo. You could reach that 25gb quite quickly. Our default value in code (and the one we use on cloud/dogfood) is 100GB.

I think this is a problem around documentation. Searcher will monitor the usage of the cache dir every 10s, and then evict if we are over the limit.

Ephemeral storage also includes logs, writeable layers etc. K8s uses a scanning technique so its not the exact second we cross 26G.

I'm surprised we are still being evicted. Do we grow by more than 1G in 10 seconds? Would more breathing room between request and limit help?

Possible. I can imagine with code insights running a lot of queries we can grow quite quickly. Maybe we do infact need a smarter diskcache module so it doesn't react to exceeding the threshold, but instead can ensure we never go over?

keegancsmith commented 2 years ago

Alright change of direction, we identified the almost certain root cause. We were writing to /tmp. We assumed the eviction was on CACHE_DIR, but it is likely /tmp. We now point TMPDIR to CACHE_DIR. I think this should drastically reduce this occurring. https://github.com/sourcegraph/sourcegraph/pull/36160 closes this, but if in the next release we still see this occurring then re-open and we can adjust strategies around eviction.

daxmc99 commented 2 years ago

@keegancsmith The dogfood cluster has several searcher containers that can be used for testing. Happy to help set this up to be easier to use for debugging

keegancsmith commented 2 years ago

@daxmc99 great, will be great to try reproduce and if we do then adjust our implementation. Tell me more please :)

sourcegraph / sourcegraph-public-snapshot

searcher: not respecting on disk cache limit #34828