SCIP-aware symbol storage

varungandhi-src commented 9 months ago

(NOTE: This is currently just an idea, not an actively worked on project.)

In our database today, we essentially have a Radix tree of SCIP symbol names stored in the codeintel_scip_symbol_names table.

The reason we use a radix tree instead of storing the SCIP symbol strings (or not storing them at all) directly is because:

We want to reduce disk usage/table size (this table is one of the biggest in the code intel database)
We need to be able to materialize the symbol strings during querying (see this CTE)

However, the radix tree does not take into account the structure of SCIP at all, making it very hard to implement different forms of matching on symbol names (such as that requested in https://github.com/sourcegraph/sourcegraph/issues/59957) other than exact matching.

Instead, we could utilize the structure of SCIP symbols:

<scheme> ' ' <manager> ' ' <package-name> ' ' <version> ' ' (<descriptor>)+

Out of these, there are probably not that many <scheme> ' ' <manager> ' ' <package-name> ' ' <version> prefixes in total anyways. In terms of the Radix tree, the total number of nodes near the root up to a depth of 3-4 are probably not that high. We could leverage this to split out the prefixes into a more structured form:

codeintel_scip_symbol_names_v2
id | upload_id | scheme_manager | package_name | package_version | symbol_ids : (array[int])

codeintel_scip_symbol_descriptors
id | reference_count | name_segment | prefix_id

So essentially splitting the symbol name into two, with the prefix stored in an easily accessible fashion.

Like the old schema, this new schema would allow:

Searching for symbols based on exact package name and version
Reasonably fast deletion for uploads - Identify rows in codeintel_scip_symbol_names_v2, iterate over symbol_ids arrays, apply decrements to reference_count, and delete rows with reference_count == 0.

Unlike the old schema, this new schema would:

Allow searching for packages and versions in a fuzzy way, as requested in https://github.com/sourcegraph/sourcegraph/issues/59957
Will likely actually reduce the storage significantly because we no longer have different rows when the descriptors are repeated across different uploads.
- (Optional) This space savings might allow us to be a bit loose with the Radix tree and break the name_segment values based on descriptors rather than at random intervals, in case that helps with better insertion logic.
  - Sketch of insertion logic: construct DescriptorPrefixTree in Go for a batch by splitting symbol strings at descriptor boundaries, do a breadth first traversal, at each tree depth retrieving the IDs for the prefixes and then perform an insert-or-bump-refcount with the (name_segment, prefix_id) pair.

Added downsides of new schema:

We'd need to storage two integers in codeintel_scip_symbols instead of just one that we're using now (symbol_id), because the symbol itself would be broken up into two chunks (one "flat", one tree-ified). That seems OK given the storage we'd recover from de-duplicating the trees across uploads.
(Risk) Because of the ref count bumping, there's the potential that the write path (= inserting uploads) interferes with reads (i.e. code nav).
(Risk) Adding an index for (name_segment, prefix_id) for insertions add a bunch of overhead/disk utilization.

varungandhi-src commented 8 months ago

The proposed table codeintel_scip_symbol_names_v2 is quite similar to lsif_references that we have today.

 Column  |  Type   | Collation | Nullable |                   Default                   
---------+---------+-----------+----------+---------------------------------------------
 id      | integer |           | not null | nextval('lsif_references_id_seq'::regclass)
 scheme  | text    |           | not null | 
 name    | text    |           | not null | 
 version | text    |           |          | 
 dump_id | integer |           | not null | 
 manager | text    |           | not null | ''::text

varungandhi-src commented 7 months ago

Somewhat related @olafurpg previously wrote a blog post about how bloom filters can be used for fuzzy symbol search. https://olafurpg.github.io/metals/blog/2019/01/22/bloom-filters.html

It might be possible to use the same technique in a hierarchical way (e.g. one filter per repo, one filter per document) to do the first level of approximate matching.

sourcegraph / sourcegraph-public-snapshot

SCIP-aware symbol storage #60800