sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.27k forks source link

Proposal: reduce heap usage of Zoekt #46276

Closed keegancsmith closed 1 year ago

keegancsmith commented 1 year ago

Summary: interesting but straightforward work on moving datastructures stored in memory in zoekt to disk can result in massive memory use reductions. Memory use is dominated by 3 different data structures, but with decreasing domination of memory usage. This means we can adjust scope and still have big wins.

The tradeoff is likely slower speed / having extra memory for caching being even more important. However, in practice when we look at performance of zoekt it is very fast compared to every other subsystem in a search request (10s of milliseconds). This means we have a lot of room to make it slower.

Risk: In practice this makes Zoekt too slow.

Win: Reductions in hosting costs due to smaller clusters / lower memory requirements.

Scope: I would target readNgrams first (~40% of heap usage). Use a b-tree (do some research on the best b-tree). Measure impact on a real workload. After 2 weeks I'd suspect results or learnings. The code itself is the quick part since the code were we generate and read ngrams is self contained. The slow part is rolling it out correctly and measuring.

[2023-01-10 Tue 12:41] Looking at memory usage of zoekt-webserver via profiler. Top 1% profiles over 7 days has:

Function % Bytes
readNgrams 39.8% 5.572GB
unmarshalDocSections 26.8% 3.748GB
readFileNameNgrams 12.3% 1.727GB

All of the above can be moved to a more classic DB structure on disk rather than being stored in memory. The tradeoff is speed, but more than likely this tradeoff will be minimal since a lot of reads on those structures then go on to read stuff on disk.

readNgrams is the most important datastructure. It is a map from trigram to offsets segment. offsets segment is were on disk we store the list of places the trigram occurs as well as how big that is. This is encoded essentially as map[int64]int64. The place we will lose speed here is that it will now take longer to determine a trigram does not appear in a repo. This might not be a problem, but techniques like bloom filters can help offset this slowdown.

unmarshalDocSections is a list of all locations of symbols. This really does not need to be in memory and would benefit from a more classic paging in approach of when we want to know the locations of symbols in a specific document.

readFileNameNgrams is a lot like readNgrams except instead of trigrams of file contents, it is trigrams of file names. Same things apply, except there is low hanging fruit here since we store in memory file names instead of loading them off disk.

/cc @sourcegraph/search-core

keegancsmith commented 1 year ago

Closing this out since @stefanhengl has shipped these improvements. The only thing left is unmarshalDocSections which is now our biggest use of heap.