Open ctb opened 2 months ago
File types that can be loaded as a single collection, per https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/fa7ab221baa9a8437353f2e894892fbd7545a479/src/utils.rs#L660 (which, yes, is potentially not inclusive of everything sourmash can actually do, but seems like a good starting point ;)
(checkbox indicates tested for use in external storage in the branchwater plugin)
test_index::test_index_protein
test_index::test_index_sig
test_index::test_index_manifest
test_index::test_index
.in an interesting nod to Luiz's point about Storage
being the key, the manifest and pathlist ones work because they can be supported by a single Storage
class, FSStorage
. The zipfile collection is supported by ZipStorage
. The single JSON file is supported by creating a new storage (?), InnerStorage
, that presumably copies the sketches - gotta look into that.
I think now I need to understand what InnerStorage
does vs Storage
... đŸ¤”
I'm also curious: can we use one RocksDB as an external storage for another RocksDB? Then maybe we could efficiently index part of one RocksDB in another RocksDB index...
Currently,
RevIndex
only supports a singleCollection
for use as external storage. This limits it to things like Zip files and .sig.gz files, and maybe manifests and pathlists of .sig.gz files.In https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/430, we are adding
MultiCollection
to the branchwater plugin, so that we can support a variety of nice features, such as standalone manifests and pathlists pointing at zip files.MultiCollection
recursively loads itself as needed.However,
MultiCollection
can't be used as aCollection
forRevIndex
. This is unfortunate and leads to some contortions, the most notable of which is that thesourmash scripts index
command can only use supportedCollection
types for external storage.It would be nice to enable a larger subset of
MultiCollection
loading functionality forRevIndex
.Note that
Storage
is a trait so perhaps one of the simplest ways forward is to implement aMultiStorage
that supports the needed flexibility, and then instantiate aCollection
with thatMultiStorage
.