Open varungandhi-src opened 9 months ago
The proposed table codeintel_scip_symbol_names_v2
is quite similar to lsif_references
that we have today.
Column | Type | Collation | Nullable | Default
---------+---------+-----------+----------+---------------------------------------------
id | integer | | not null | nextval('lsif_references_id_seq'::regclass)
scheme | text | | not null |
name | text | | not null |
version | text | | |
dump_id | integer | | not null |
manager | text | | not null | ''::text
Somewhat related @olafurpg previously wrote a blog post about how bloom filters can be used for fuzzy symbol search. https://olafurpg.github.io/metals/blog/2019/01/22/bloom-filters.html
It might be possible to use the same technique in a hierarchical way (e.g. one filter per repo, one filter per document) to do the first level of approximate matching.
(NOTE: This is currently just an idea, not an actively worked on project.)
In our database today, we essentially have a Radix tree of SCIP symbol names stored in the
codeintel_scip_symbol_names
table.The reason we use a radix tree instead of storing the SCIP symbol strings (or not storing them at all) directly is because:
However, the radix tree does not take into account the structure of SCIP at all, making it very hard to implement different forms of matching on symbol names (such as that requested in https://github.com/sourcegraph/sourcegraph/issues/59957) other than exact matching.
Instead, we could utilize the structure of SCIP symbols:
Out of these, there are probably not that many
<scheme> ' ' <manager> ' ' <package-name> ' ' <version>
prefixes in total anyways. In terms of the Radix tree, the total number of nodes near the root up to a depth of 3-4 are probably not that high. We could leverage this to split out the prefixes into a more structured form:So essentially splitting the symbol name into two, with the prefix stored in an easily accessible fashion.
Like the old schema, this new schema would allow:
codeintel_scip_symbol_names_v2
, iterate oversymbol_ids
arrays, apply decrements toreference_count
, and delete rows withreference_count == 0
.Unlike the old schema, this new schema would:
name_segment
values based on descriptors rather than at random intervals, in case that helps with better insertion logic.DescriptorPrefixTree
in Go for a batch by splitting symbol strings at descriptor boundaries, do a breadth first traversal, at each tree depth retrieving the IDs for the prefixes and then perform an insert-or-bump-refcount with the(name_segment, prefix_id)
pair.Added downsides of new schema:
codeintel_scip_symbols
instead of just one that we're using now (symbol_id
), because the symbol itself would be broken up into two chunks (one "flat", one tree-ified). That seems OK given the storage we'd recover from de-duplicating the trees across uploads.(name_segment, prefix_id)
for insertions add a bunch of overhead/disk utilization.