Open ctb opened 2 years ago
Some things going on the SqliteIndex
PR https://github.com/sourmash-bio/sourmash/pull/1808 make me think that we should enable individual retrieval via manifest row. That gives storages the ability to figure out what collection of information is best, include off-label manifest row columns like primary keys in sqlite databases...
over in calc-full-gather https://github.com/ctb/2024-calc-full-gather ref https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/187, I wrote a generic function that used manifest rows to load specific sketches from a zip file:
def zipfile_load_ss_from_row(db, row):
data = db.storage.load(row['internal_location'])
sigs = sourmash.signature.load_signatures(data)
return_sig = None
for ss in sigs:
if ss.md5sum() == row['md5']:
assert return_sig is None # there can only be one!
return_sig = ss
if return_sig is None:
raise ValueError("no match to requested row in db")
return return_sig
Curious how this approach would generalize to all Index
classes and also how it would interact with Rust Collection layer.
In https://github.com/sourmash-bio/sourmash/pull/1837, we change
sourmash sig extract
to identify signatures to extract using manifest rows, and then have to convert the manifest rows into a manifest and then from there into a picklist in order to actually extract the sketches. This seems circuitous.It also means that
sourmash sig extract --picklist
does not work on certain database types that do not support multiple picklists - LCA DBs, SBTs, and zipfiles w/o a manifest, for example.Two ideas, not mutually exclusive -
one, we could have
Index
classes provide a signature getter that works on internal locations in manifests.two, we could directly provide a method for retrieving many signatures, given a manifest (or, really, just a list of internal locations).
What I don't remember offhand is whether all
Index
classes support internal locations. If not, that would be a problem.