sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 80 forks source link

how might we distribute "diff" or patch databases? #985

Open ctb opened 4 years ago

ctb commented 4 years ago

re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?

random thoughts -

this could tie into some of the work that @luizirber is doing with IPFS, I suspect.

luizirber commented 4 years ago

re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?

I think we should use the multi-DB capabilities in search/gather to:

Taxonomy-wise, include the taxinfo in full builds, but for each week provide the updated taxinfo for full build + this week. I think this avoids issues with having to dig into every single DB for taxonomy information, and since taxonomy is also updated 'more frequently' (the signature/original dataset will never change for a specific version, but the taxonomic assignment DOES change) this allows more accurate results (and older gather CSVs can be updated with newer tax assignments without having to re-run gather, for example).

updating taxonomy is no problem if we can override taxonomy per #969 (comment) - just provide an updated lineage db. they're small.

I think we should continue reporting the dataset ID (be it GCA for genbank, or GCF for refseq, or similarly for GTDB) in the sourmash outputs, and then provide the taxinfo with the mapping to the lineage (connected to point above, about updating old results without having to re-run sourmash)

updating LCA and SBTs is more of a problem. even if we could remove signatures from them, I would prefer not to ask people to re-download massive dbs! right now we don't have the ability to "screen" signatures from searches in those databases. I suspect this wouldn't be too hard to add, perhaps via a selector API or something else. basically, add a way to specify that "this signature, as identified by md5sum, should never be reported."

These two go together, I suspect: we can remove files from full builds, and maybe provide the 'screen' in weekly builds (as an additional file in the DB?) to indicate what matches to skip?

Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?

this could tie into some of the work that @luizirber is doing with IPFS, I suspect.

Incoming brain dump!

456 is a fairly old PR, and I don't even know how to properly rebase it for today's codebase, but most of it migrated to other PRs:

One thing still left (and connected to this issue) is the prepare command. The idea is to take a an index description (a .sbt.json file) and prepare a local copy for usage. There is a test showing how to use a .sbt.json with IPFS as storage, and load it into a FSStorage (hidden dir) locally. The _fill_up/repair comes into play because the IPFS .sbt.json can be leaf-only, and during prepare the steps would be:

There are a bunch of optimizations that can be done to avoid consuming too much memory:

So: I think this connects with IPFS and this issue because, instead of providing full ZIP files, we could provide only the description and change instruction to run prepare before using a DB. This is less convenient than wget/curl a DB, but if we are providing frequent updates it is simply unsunstainable to keep all that (redundant) data available permanently. Unless we find some sort of funding/sponsorship for it...

ctb commented 4 years ago

wow, that went in a direction all right. Not sure how to respond to the IPFS stuff, have to re-read that or maybe brainstorm in person :).

re

Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?

yes, some genomes are just broken and get removed or deprecated, and I don't think they should be available for search.

Note, for the genomeRxiv work, we will face similar questions of how to provide regular database updates. Since we should have actual funding for that, maybe that's a place to dig in!

luizirber commented 4 years ago

Feedback from personal comm:

Anything dead simply to retrieve and use. FTP is blocked at XXX and other institutions which I never would've believed when I was previously in academia. Even fetching rust libraries was blocked here.

ctb commented 3 years ago

1477 could add support for "masking" arbitrary signatures from search and gather.

ctb commented 3 years ago

see also https://github.com/dib-lab/sourmash/issues/433

ctb commented 3 years ago

a few quick thoughts -

ctb commented 2 years ago

this is a fascinating situation where we could actually use manifests. just thinking out loud:

my first (bad) idea is that we could simply edit manifests, since (as noted in https://github.com/sourmash-bio/sourmash/issues/1849) there are situations where they don't necessarily contain all signatures, anyway.

a second (better?) idea is to add a 'deprecated' field that marks the signature as something to ignore.

a third (maybe actually good?) idea is to add a 'deprecated by' column that points at another signature (maybe an md5?).

a fourth (also maybe actually good) idea is to add a 'deprecates' column in database manifests that would support ignoring signatures in older databases. not sure how to best indicate which signature to ignore - md5 + identifier, maybe?

the first three ideas all involve modifying old databases. boo. the fourth only involves modifying new databases.

ctb commented 2 years ago

keyword search bait: database updates, update databases, incremental database updates