sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

consider how to support more flexible `Collection` in `RevIndex` for external storage #3321

Open ctb opened 2 months ago

ctb commented 2 months ago

Currently, RevIndex only supports a single Collection for use as external storage. This limits it to things like Zip files and .sig.gz files, and maybe manifests and pathlists of .sig.gz files.

In https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/430, we are adding MultiCollection to the branchwater plugin, so that we can support a variety of nice features, such as standalone manifests and pathlists pointing at zip files. MultiCollection recursively loads itself as needed.

However, MultiCollection can't be used as a Collection for RevIndex. This is unfortunate and leads to some contortions, the most notable of which is that the sourmash scripts index command can only use supported Collection types for external storage.

It would be nice to enable a larger subset of MultiCollection loading functionality for RevIndex.

Note that Storage is a trait so perhaps one of the simplest ways forward is to implement a MultiStorage that supports the needed flexibility, and then instantiate a Collection with that MultiStorage.

ctb commented 2 months ago

File types that can be loaded as a single collection, per https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/fa7ab221baa9a8437353f2e894892fbd7545a479/src/utils.rs#L660 (which, yes, is potentially not inclusive of everything sourmash can actually do, but seems like a good starting point ;)

(checkbox indicates tested for use in external storage in the branchwater plugin)

in an interesting nod to Luiz's point about Storage being the key, the manifest and pathlist ones work because they can be supported by a single Storage class, FSStorage. The zipfile collection is supported by ZipStorage. The single JSON file is supported by creating a new storage (?), InnerStorage, that presumably copies the sketches - gotta look into that.

I think now I need to understand what InnerStorage does vs Storage... đŸ¤”

I'm also curious: can we use one RocksDB as an external storage for another RocksDB? Then maybe we could efficiently index part of one RocksDB in another RocksDB index...