sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.12k stars 1.29k forks source link

SCIP-aware symbol storage #60800

Open varungandhi-src opened 9 months ago

varungandhi-src commented 9 months ago

(NOTE: This is currently just an idea, not an actively worked on project.)

In our database today, we essentially have a Radix tree of SCIP symbol names stored in the codeintel_scip_symbol_names table.

The reason we use a radix tree instead of storing the SCIP symbol strings (or not storing them at all) directly is because:

  1. We want to reduce disk usage/table size (this table is one of the biggest in the code intel database)
  2. We need to be able to materialize the symbol strings during querying (see this CTE)

However, the radix tree does not take into account the structure of SCIP at all, making it very hard to implement different forms of matching on symbol names (such as that requested in https://github.com/sourcegraph/sourcegraph/issues/59957) other than exact matching.

Instead, we could utilize the structure of SCIP symbols:

<scheme> ' ' <manager> ' ' <package-name> ' ' <version> ' ' (<descriptor>)+

Out of these, there are probably not that many <scheme> ' ' <manager> ' ' <package-name> ' ' <version> prefixes in total anyways. In terms of the Radix tree, the total number of nodes near the root up to a depth of 3-4 are probably not that high. We could leverage this to split out the prefixes into a more structured form:

codeintel_scip_symbol_names_v2
id | upload_id | scheme_manager | package_name | package_version | symbol_ids : (array[int])

codeintel_scip_symbol_descriptors
id | reference_count | name_segment | prefix_id

So essentially splitting the symbol name into two, with the prefix stored in an easily accessible fashion.

Like the old schema, this new schema would allow:

Unlike the old schema, this new schema would:

Added downsides of new schema:

varungandhi-src commented 8 months ago

The proposed table codeintel_scip_symbol_names_v2 is quite similar to lsif_references that we have today.

 Column  |  Type   | Collation | Nullable |                   Default                   
---------+---------+-----------+----------+---------------------------------------------
 id      | integer |           | not null | nextval('lsif_references_id_seq'::regclass)
 scheme  | text    |           | not null | 
 name    | text    |           | not null | 
 version | text    |           |          | 
 dump_id | integer |           | not null | 
 manager | text    |           | not null | ''::text
varungandhi-src commented 7 months ago

Somewhat related @olafurpg previously wrote a blog post about how bloom filters can be used for fuzzy symbol search. https://olafurpg.github.io/metals/blog/2019/01/22/bloom-filters.html

It might be possible to use the same technique in a hierarchical way (e.g. one filter per repo, one filter per document) to do the first level of approximate matching.