sourmash-bio / sourmash_plugin_branchwater

fast, multithreaded sourmash operations: search, compare, and gather.
GNU Affero General Public License v3.0
15 stars 2 forks source link

Improve support for external storage of sketches in RocksDB indexes #415

Open ctb opened 1 month ago

ctb commented 1 month ago

In https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/381, we found a problem when using RocksDB inverted indexes with zip files.

Starting from https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/390#issuecomment-2225776967,

I found the problem in the storage::InnerStorage struct, which was opening the zip file using a relative path:

InnerStorage::from_spec: opening zip://list.sig.zip

Conveniently 😭 this is not a new bug or problem, this is the same problem that I discussed ad nauseum over in https://github.com/sourmash-bio/sourmash/issues/3008, where we found it to be a problem for manifests - triggered initially by https://github.com/sourmash-bio/sourmash/issues/3053.

In brief, the problem is this: when we refer to an external storage, how do we interpret non-absolute paths?

As I wrote in #3008, "I am slowly coming around to the idea that loading things relative to the manifest path is correct."

So I think the Right Fix would be to rejigger the path to the zip file to be interpreted relative to the RocksDB location.

However, there is another fun component to this, which is that I'm not sure it's documented anywhere that RocksDB indices created by this plugin store the sketches externally, which is needed for gather (but not for manysearch)... Per @luizirber on slack,

me:

do RocksDB disk rev indexes, once created, require access to the original signatures to function?

@luizirber:

Yes for gather, no for search You can also build the index with internal sigs, but I don’t think it’s the default


In https://github.com/sourmash-bio/sourmash/pull/3250, @luizirber implemented internal storage for RocksDB, and over in #390 that is now being set as the default for new RocksDB indexes; documentation is being updated as well, in a new PR.

However, the original issue of relative path interpretation still remains, so this issue is here to remind us of all that went before.

ctb commented 2 weeks ago

NOTE: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/416 adds documentation regarding internal vs external storage.