sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

should we add an `Index.get` method (or something similar) to retrieve signatures? #1848

Open ctb opened 2 years ago

ctb commented 2 years ago

In https://github.com/sourmash-bio/sourmash/pull/1837, we change sourmash sig extract to identify signatures to extract using manifest rows, and then have to convert the manifest rows into a manifest and then from there into a picklist in order to actually extract the sketches. This seems circuitous.

It also means that sourmash sig extract --picklist does not work on certain database types that do not support multiple picklists - LCA DBs, SBTs, and zipfiles w/o a manifest, for example.

Two ideas, not mutually exclusive -

one, we could have Index classes provide a signature getter that works on internal locations in manifests.

two, we could directly provide a method for retrieving many signatures, given a manifest (or, really, just a list of internal locations).

What I don't remember offhand is whether all Index classes support internal locations. If not, that would be a problem.

ctb commented 2 years ago

Some things going on the SqliteIndex PR https://github.com/sourmash-bio/sourmash/pull/1808 make me think that we should enable individual retrieval via manifest row. That gives storages the ability to figure out what collection of information is best, include off-label manifest row columns like primary keys in sqlite databases...

ctb commented 8 months ago

over in calc-full-gather https://github.com/ctb/2024-calc-full-gather ref https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/187, I wrote a generic function that used manifest rows to load specific sketches from a zip file:

def zipfile_load_ss_from_row(db, row):
    data = db.storage.load(row['internal_location'])
    sigs = sourmash.signature.load_signatures(data)

    return_sig = None
    for ss in sigs:
        if ss.md5sum() == row['md5']:
            assert return_sig is None # there can only be one!
            return_sig = ss

    if return_sig is None:
        raise ValueError("no match to requested row in db")
    return return_sig

Curious how this approach would generalize to all Index classes and also how it would interact with Rust Collection layer.