Revisiting the design of the anvi'o contigs database and genomes storage

Currently genomes storage is generated from internal and external genomes files, and is used for the pangenomic workflow. Unfortunately due to my oversight at the beginning it has some shortcomings:

While conceptually it is very similar to a merged profile (which is what we get when we merge single profiles), in implementation it is very different (i.e., it is not what we get when we merge contigs databases).
It doesn't keep track of contigs,
Classes that operate on contigs databases do not work with genome storages, creating an unnecessary complexity in the codebase.

In theory the reason we are not 'merging' contigs databases is because while every single profile is a single profile, not every contigs database represents a single genome. A contigs database can have 100,000 contigs from a metagenomic assembly, or 1 contig to <1000 contigs from a single isolate, single-cell, or a population genome. Clearly the latter represents the kinds of contigs databases we wish ti merge into genomes storages to we can do pangenomics stuff. While it would have been lovely to do pangenomes on metagenomes level, and I think it will be something a lot of people will be interested in the future including us, currently this is not quite computationally feasible (unless we start thinking about more greedy strategies than reciprocal blasts to identify 'edges' between genes).

So, what to do about it? Here are some 2 cents:

We add a paramater to anvi-gen-contigs-database program, --type, which by default is metagenome, but the user has the option to set it to genome.
We change anvi-gen-genomes-storage in such a way that it only merges contigs databases of type genome.
We change the structure of contigs databases so they can be merged the way profile databases are merged (the way we have a column for merged profiles called sample_name, we can have a column for genomes storage called genome_name).
While we are at it, we need to change the way we store functions. If we are merging 100 genomes, and if they are all annotated with COGs, there will be crazy amount of redundancy in the functions table. The right way to do this is to implement a new class structure that works with multiple relational tables to manage functional annotations.
While we are doing these we continue to support internal genomes concept so we maintain our ability to support metapangenomics workflow.
Finally, we change anvi-pan-genome program to work with these genome storages.

I think comparative genomics is going to be a significant revenue for us to expand, and this design bottleneck will only get more and more limiting over time.

merenlab / anvio

Revisiting the design of the anvi'o contigs database and genomes storage #840