We store symbol names for scip documents in a table called codeintel_scip_symbol_names and in order to save space we encode them as a prefix-trie inside the database.
When storing larger scip indices we break up the work into chunks. We then compute the optimal trie for each individual chunk and flush it to the aforementioned database table. This means parts of the trie that appear in multiple chunks get duplicated, as we don't analyze for overlaps across chunks.
There's a script to check the potential saving in #60703. For example the scip-go generated index on sourcegraph/sourcegraph ends up using 20% extra rows in the database.
❯ cat ../../llvm.csv | cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/merge-tries`
from 1345913 rows
to 813979 rows
reduction by 39.52%
We store symbol names for scip documents in a table called
codeintel_scip_symbol_names
and in order to save space we encode them as a prefix-trie inside the database. When storing larger scip indices we break up the work into chunks. We then compute the optimal trie for each individual chunk and flush it to the aforementioned database table. This means parts of the trie that appear in multiple chunks get duplicated, as we don't analyze for overlaps across chunks.There's a script to check the potential saving in #60703. For example the
scip-go
generated index on sourcegraph/sourcegraph ends up using 20% extra rows in the database.