ctb commented 4 years ago

re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?

random thoughts -

updating taxonomy is no problem if we can override taxonomy per https://github.com/dib-lab/sourmash/issues/969#issuecomment-621253694 - just provide an updated lineage db. they're small.
updating LCA and SBTs is more of a problem. even if we could remove signatures from them, I would prefer not to ask people to re-download massive dbs!
right now we don't have the ability to "screen" signatures from searches in those databases. I suspect this wouldn't be too hard to add, perhaps via a selector API or something else. basically, add a way to specify that "this signature, as identified by md5sum, should never be reported."

this could tie into some of the work that @luizirber is doing with IPFS, I suspect.

luizirber commented 4 years ago

re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?

I think we should use the multi-DB capabilities in search/gather to:

release a full build every 2-3 months
release diffs each week, with only new genomes added Not sure how to name them, but <year>.<week-of-the-year>.sbt.zip might work for the latter.

Taxonomy-wise, include the taxinfo in full builds, but for each week provide the updated taxinfo for full build + this week. I think this avoids issues with having to dig into every single DB for taxonomy information, and since taxonomy is also updated 'more frequently' (the signature/original dataset will never change for a specific version, but the taxonomic assignment DOES change) this allows more accurate results (and older gather CSVs can be updated with newer tax assignments without having to re-run gather, for example).

updating taxonomy is no problem if we can override taxonomy per #969 (comment) - just provide an updated lineage db. they're small.

I think we should continue reporting the dataset ID (be it GCA for genbank, or GCF for refseq, or similarly for GTDB) in the sourmash outputs, and then provide the taxinfo with the mapping to the lineage (connected to point above, about updating old results without having to re-run sourmash)

updating LCA and SBTs is more of a problem. even if we could remove signatures from them, I would prefer not to ask people to re-download massive dbs! right now we don't have the ability to "screen" signatures from searches in those databases. I suspect this wouldn't be too hard to add, perhaps via a selector API or something else. basically, add a way to specify that "this signature, as identified by md5sum, should never be reported."

These two go together, I suspect: we can remove files from full builds, and maybe provide the 'screen' in weekly builds (as an additional file in the DB?) to indicate what matches to skip?

Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?

this could tie into some of the work that @luizirber is doing with IPFS, I suspect.

Incoming brain dump!

456 is a fairly old PR, and I don't even know how to properly rebase it for today's codebase, but most of it migrated to other PRs:

ZipStorage in #648
Split nodes into internal and leaves
the _fill_up method for doing bottom-up (leaves to root) processing of the SBT (used for setting min_n_below)
unload appeared briefly, but was not completely defined like in #784
(update_internal parameter for SBTs was not ported, and I think it is actually damaging for #925, so probably skip it)

One thing still left (and connected to this issue) is the prepare command. The idea is to take a an index description (a .sbt.json file) and prepare a local copy for usage. There is a test showing how to use a .sbt.json with IPFS as storage, and load it into a FSStorage (hidden dir) locally. The _fill_up/repair comes into play because the IPFS .sbt.json can be leaf-only, and during prepare the steps would be:

Download all leaves (potentially in parallel)
Run _fill_internal, which creates all the internal nodes
Save to a new local SBT

There are a bunch of optimizations that can be done to avoid consuming too much memory:

As leaves are downloaded, save it to the Storage (they won't change)
For the internal level right above the leaves, build the internal node if all leaves are available, save it to storage, and unload it (and the leaves under it)
When root is reached, save the index description This also fits well with the zipped SBTs.

So: I think this connects with IPFS and this issue because, instead of providing full ZIP files, we could provide only the description and change instruction to run prepare before using a DB. This is less convenient than wget/curl a DB, but if we are providing frequent updates it is simply unsunstainable to keep all that (redundant) data available permanently. Unless we find some sort of funding/sponsorship for it...

ctb commented 4 years ago

wow, that went in a direction all right. Not sure how to respond to the IPFS stuff, have to re-read that or maybe brainstorm in person :).

re

Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?

yes, some genomes are just broken and get removed or deprecated, and I don't think they should be available for search.

Note, for the genomeRxiv work, we will face similar questions of how to provide regular database updates. Since we should have actual funding for that, maybe that's a place to dig in!

luizirber commented 4 years ago

Feedback from personal comm:

Anything dead simply to retrieve and use. FTP is blocked at XXX and other institutions which I never would've believed when I was previously in academia. Even fetching rust libraries was blocked here.

ctb commented 3 years ago

1477 could add support for "masking" arbitrary signatures from search and gather.

ctb commented 3 years ago

a few quick thoughts -

picklist include and exclude can be used by pipelines to include only the updated signatures as well as exclude signatures/databases that have already been searched
while 'gather' results cannot easily be updated to reflect new databases, prefetch results can be and that then allows more efficient updating of gather.

ctb commented 2 years ago

this is a fascinating situation where we could actually use manifests. just thinking out loud:

my first (bad) idea is that we could simply edit manifests, since (as noted in https://github.com/sourmash-bio/sourmash/issues/1849) there are situations where they don't necessarily contain all signatures, anyway.

a second (better?) idea is to add a 'deprecated' field that marks the signature as something to ignore.

a third (maybe actually good?) idea is to add a 'deprecated by' column that points at another signature (maybe an md5?).

a fourth (also maybe actually good) idea is to add a 'deprecates' column in database manifests that would support ignoring signatures in older databases. not sure how to best indicate which signature to ignore - md5 + identifier, maybe?

the first three ideas all involve modifying old databases. boo. the fourth only involves modifying new databases.

ctb commented 2 years ago

keyword search bait: database updates, update databases, incremental database updates

sourmash-bio / sourmash

how might we distribute "diff" or patch databases? #985

456 is a fairly old PR, and I don't even know how to properly rebase it for today's codebase, but most of it migrated to other PRs:

1477 could add support for "masking" arbitrary signatures from search and gather.