sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Sourcegraph.com zoekt needs more resources #12455

Closed slimsag closed 3 years ago

slimsag commented 4 years ago

These alerts are firing frequently but may have already been addressed by Keegan:

"warning_zoekt_indexserver_provisioning_container_cpu_usage_5m", "warning_zoekt_indexserver_provisioning_container_cpu_usage_7d_high", "warning_zoekt_webserver_provisioning_container_cpu_usage_7d_high", "warning_zoekt_webserver_provisioning_container_memory_usage_7d_high",

Regardless, I have silenced them as they are still firing/noisy. please fix, confirm the alerts are not firing anymore, and then unsilence them: https://github.com/sourcegraph/deploy-sourcegraph-dot-com/blob/e33d7cd48e9407aac88124eec89644dd4d51699c/base/frontend/sourcegraph-frontend.ConfigMap.yaml#L5271-L5275

bobheadxi commented 4 years ago

Following some of the linked PRs, it seems like:

cc @sourcegraph/search

keegancsmith commented 4 years ago

Here is the graph over the last 14d:

image

I'm not sure how this graph interacts with silences, but it seems that the day this issue was filed (24th) the warning alerts went up a bunch. The 29th I shipped some fixes and the big one on the 30th (and one or two more the next day or two). You can see the graph go mostly silent again. Then on the weekend we scaled up index search to be 200k instead of 100k repos and you can see on Monday (when the site has more activity) all the alerts start up again. I would suspect the root cause is that. cc @beyang

keegancsmith commented 4 years ago

zoekt-indexserver still uses significant CPU, without seeming to affect the only other service metric available (average revision resolve duration)

Average revision resolve duration will is measuring an RPC call, so it is putting load on gitserver (via frontend). So lots of CPU use indicates it is likely indexing a lot. I would look at the recently added queue metrics.

zoekt-webserver still uses all of its memory, with frequently firing "50s+ indexed search request errors every 5m by code" alerts - this might be an issue of the alert being on a hard threshold rather than a ratio

The webserver uses a lot of memory even if it is not serving any requests. IE the memory use is dominated by the working set of indexes, not the results or number of them generated. At this scale we would need a lot of traffic for it to contribute to memory use over just holding the indexes in memory.