sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Duplicated nodes in symbol name trie #60704

Open kritzcreek opened 7 months ago

kritzcreek commented 7 months ago

We store symbol names for scip documents in a table called codeintel_scip_symbol_names and in order to save space we encode them as a prefix-trie inside the database. When storing larger scip indices we break up the work into chunks. We then compute the optimal trie for each individual chunk and flush it to the aforementioned database table. This means parts of the trie that appear in multiple chunks get duplicated, as we don't analyze for overlaps across chunks.

There's a script to check the potential saving in #60703. For example the scip-go generated index on sourcegraph/sourcegraph ends up using 20% extra rows in the database.

varungandhi-src commented 7 months ago

Numbers for LLVM:

❯ cat ../../llvm.csv | cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/merge-tries`
from 1345913 rows
to   813979 rows
reduction by 39.52%