merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

Revisiting the design of the anvi'o contigs database and genomes storage #840

Closed meren closed 4 years ago

meren commented 6 years ago

Currently genomes storage is generated from internal and external genomes files, and is used for the pangenomic workflow. Unfortunately due to my oversight at the beginning it has some shortcomings:

In theory the reason we are not 'merging' contigs databases is because while every single profile is a single profile, not every contigs database represents a single genome. A contigs database can have 100,000 contigs from a metagenomic assembly, or 1 contig to <1000 contigs from a single isolate, single-cell, or a population genome. Clearly the latter represents the kinds of contigs databases we wish ti merge into genomes storages to we can do pangenomics stuff. While it would have been lovely to do pangenomes on metagenomes level, and I think it will be something a lot of people will be interested in the future including us, currently this is not quite computationally feasible (unless we start thinking about more greedy strategies than reciprocal blasts to identify 'edges' between genes).

So, what to do about it? Here are some 2 cents:

I think comparative genomics is going to be a significant revenue for us to expand, and this design bottleneck will only get more and more limiting over time.

meren commented 4 years ago

@ozcan has addressed this by changing the design completely for the genome view branch.