sourmash-bio / branchwater

Searching large collections of sequencing data with genome-scale queries
https://branchwater.sourmash.bio
Other
6 stars 2 forks source link

Move metadata from mongodb into the index manifest? #18

Open luizirber opened 4 months ago

luizirber commented 4 months ago

Over at https://github.com/sourmash-bio/sourmash/issues/3006#issuecomment-1950258163 I mentioned adding extra columns to manifest to hold metadata not available in a signature. I think we can do the same approach to store the SRA metadata into the manifest, and remove the mongodb dependency, returning the metadata from the search index together with the containment.

More refs on the sourmash context: https://github.com/sourmash-bio/sourmash/issues/2180

But... is it a good idea?

Over at #4 I'm trying to make it easy to bring up a new branchwater installation, and there is a bit of a dance for building index, bringing up mongo, loading metadata, and then bringing up server/frontend. Moving the metadata into the index building step makes things easier, but requires to be able to update the manifest in the index in case we want different data (which is not that hard, it's a CSV). It can be more constraining for developing new frontend features, tho?

pinging @bluegenes and @SuzanneFleishman for ideas =]

SuzanneFleishman commented 4 months ago

Interesting idea @luizirber! Offhand I think it makes sense but I don't have a great handle on the mechanics. Were you thinking of a similar approach to filtering metadata from bigquery but just adding it to the manifest rather than the mongodb?

My initial thoughts on drawbacks:t 1- this would not constrain the frontend as-is, but may make it clunky if we did want to do some sort of overall visualization of every accession and it's metadata (like here: https://web.app.ufz.de/marmdb/)?

2 - The way metadata is organized in the app right now could drastically be improved. I took a bit of a 'take what we get' approach, because what is available from the SRA varies so much and it would take a lot of time to slightly improve it. As branchwater updates, I'm assuming it wouldn't pull the entire SRA metadata, just the metadata for the newly added accessions? As long as adding the metadata to the manifest doesn't make it a huge pain to update the entire manifest, if for example the SRA reorganizes it or someone improves my filter method, I don't see an issue.

3 - One reason we went with mongodb is it's super fast to search the accessions and pull the select metadata of interest in one query- how do you think the manifest will compare? I'm guessing all the metadata would be returned and we'd filter to metadata of interest on the flask server as a second step.

bluegenes commented 4 months ago

I really like it from a simplicity point of view, plus extended manifests would support additional utility for sourmash and/or api access. I don't have a sense for potential performance drawbacks, though!

ctb commented 4 months ago

hot take: don't do it in the manifest by adding columns, but support multiple files that key on ident strings in the name like the taxonomy stuff in sourmash.

There's some description of this over in https://sourmash.readthedocs.io/en/latest/sourmash-internals.html#taxonomy-and-assigning-lineages, but for this crowd & channeling https://github.com/sourmash-bio/sourmash/issues/1790 -

taxonomy in sourmash works by getting results from (e.g.) sourmash gather that contain space-separated identifiers (ident - usually GenBank accessions in practice, but that is not required) and then cross-referencing those with a separate taxonomy database

So, for example, the following taxonomy spreadsheet:

ident,superkingdom,phylum,class,order,family,genus,species
GCF_014075335.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia flexneri
GCF_000578955.1,d__Bacteria,p__Firmicutes,c__Bacilli,o__Staphylococcales,f__Staphylococcaceae,g__Staphylococcus,s__Staphylococcus aureus

would let us identify results for genomes with GCF_014075335.1 in the first "field" of their name as E. flexneri, and GCF_000578955.1 as S. aureus.

This scheme has proven to be pretty robust and debuggable in practice, and it allows us to support multiple different taxonomies in sourmash (I think we're up to NCBI, GTDB, LINS, and ICTV!) with only a moderate amount of @bluegenes blood, sweat, and tears.


So the modified proposal I'd suggest here -

we combine Index manifests with a separate file that supports all sorts of fun columns and or hierarchies as discussed in https://github.com/sourmash-bio/sourmash/issues/2180; we do the appropriate inner joins where needed; and then fun/profit 💰

In the case of branchwater-web we could more rapidly evolve the format of this metadata file to meet needs. heck, it could even remain in the mongodb, maybe.

over in sourmash I'd probably suggest adding generic support for this into our plugin interface so that we could try things out freely.

conveniently, this also would help support private/custom/user-specific metadata so that people could build up their own annotated/curated databases of SRA info and then use them as picklists for the search output - perhaps something to support in future versions of the Web app?