Open ctb opened 4 years ago
re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?
I think we should use the multi-DB capabilities in search
/gather
to:
<year>.<week-of-the-year>.sbt.zip
might work for the latter.Taxonomy-wise, include the taxinfo
in full builds, but for each week provide the updated taxinfo
for full build + this week. I think this avoids issues with having to dig into every single DB for taxonomy information, and since taxonomy is also updated 'more frequently' (the signature/original dataset will never change for a specific version, but the taxonomic assignment DOES change) this allows more accurate results (and older gather
CSVs can be updated with newer tax assignments without having to re-run gather
, for example).
updating taxonomy is no problem if we can override taxonomy per #969 (comment) - just provide an updated lineage db. they're small.
I think we should continue reporting the dataset ID (be it GCA
for genbank, or GCF
for refseq, or similarly for GTDB) in the sourmash outputs, and then provide the taxinfo
with the mapping to the lineage (connected to point above, about updating old results without having to re-run sourmash)
updating LCA and SBTs is more of a problem. even if we could remove signatures from them, I would prefer not to ask people to re-download massive dbs! right now we don't have the ability to "screen" signatures from searches in those databases. I suspect this wouldn't be too hard to add, perhaps via a selector API or something else. basically, add a way to specify that "this signature, as identified by md5sum, should never be reported."
These two go together, I suspect: we can remove files from full builds, and maybe provide the 'screen' in weekly builds (as an additional file in the DB?) to indicate what matches to skip?
Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?
this could tie into some of the work that @luizirber is doing with IPFS, I suspect.
Incoming brain dump!
_fill_up
method for doing bottom-up (leaves to root) processing of the SBT (used for setting min_n_below
)unload
appeared briefly, but was not completely defined like in #784 update_internal
parameter for SBTs was not ported, and I think it is actually damaging for #925, so probably skip it)One thing still left (and connected to this issue) is the prepare
command. The idea is to take a an index description (a .sbt.json
file) and prepare a local copy for usage. There is a test showing how to use a .sbt.json
with IPFS as storage, and load it into a FSStorage
(hidden dir) locally. The _fill_up
/repair comes into play because the IPFS .sbt.json
can be leaf-only, and during prepare
the steps would be:
_fill_internal
, which creates all the internal nodesThere are a bunch of optimizations that can be done to avoid consuming too much memory:
So: I think this connects with IPFS and this issue because, instead of providing full ZIP files, we could provide only the description and change instruction to run prepare
before using a DB. This is less convenient than wget
/curl
a DB, but if we are providing frequent updates it is simply unsunstainable to keep all that (redundant) data available permanently. Unless we find some sort of funding/sponsorship for it...
wow, that went in a direction all right. Not sure how to respond to the IPFS stuff, have to re-read that or maybe brainstorm in person :).
re
Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?
yes, some genomes are just broken and get removed or deprecated, and I don't think they should be available for search.
Note, for the genomeRxiv work, we will face similar questions of how to provide regular database updates. Since we should have actual funding for that, maybe that's a place to dig in!
Feedback from personal comm:
Anything dead simply to retrieve and use. FTP is blocked at XXX and other institutions which I never would've believed when I was previously in academia. Even fetching rust libraries was blocked here.
a few quick thoughts -
this is a fascinating situation where we could actually use manifests. just thinking out loud:
my first (bad) idea is that we could simply edit manifests, since (as noted in https://github.com/sourmash-bio/sourmash/issues/1849) there are situations where they don't necessarily contain all signatures, anyway.
a second (better?) idea is to add a 'deprecated' field that marks the signature as something to ignore.
a third (maybe actually good?) idea is to add a 'deprecated by' column that points at another signature (maybe an md5?).
a fourth (also maybe actually good) idea is to add a 'deprecates' column in database manifests that would support ignoring signatures in older databases. not sure how to best indicate which signature to ignore - md5 + identifier, maybe?
the first three ideas all involve modifying old databases. boo. the fourth only involves modifying new databases.
keyword search bait: database updates, update databases, incremental database updates
re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?
random thoughts -
this could tie into some of the work that @luizirber is doing with IPFS, I suspect.