Moniker Size and Storage Optimization

We've been playing around and working with the SCIP indices for quite a while now and one thing became clear, they take up a lot of space, especially considering the scale of our codebase.

One though was that monikers take up a large chunk of the space, where a lot of it is redundant information, eg: Every within some/package/path symbol has: scip-go gomod some/package/path v0.0.4 as a preamble, then as the actual descriptors we havesome/package/path/Struct# which duplicates the package path provided in the preamble. In our usecase we only need the descriptors, and maybe the package version so a thought was to just strip the prefix from the indices. Before doing that I wanted to ask if any other considerations were made around the sizing of the indices.

We considered compressing as these strings would be very compressible, but our index reader would still need to scan through the uncompressed index regardless.

Another option was to define symbol/moniker mappings for the index, which would map the moniker to a unique id so it may be reused, similar to how LSIF handles it.This could be an optional feature in the index definition either on the document or index level. This would likely also give a good indication of all the symbols the index references without having to read through the docs&occurrences.

sourcegraph / scip

Moniker Size and Storage Optimization #276