Metabolism estimation for gene cluster bins in a pangenome

ivagljiva commented 8 months ago

As requested by @FlorianTrigodet, this PR adds a new input option to anvi-estimate-metabolism: gene cluster bin collections from pangenomes. The idea behind this is that people can see which pathways are complete within the core and accessory genomes as inferred from the pangenome.

Here is the implementation summary. For the estimation process, we need to extract gene annotations from each bin. Since we treat each gene cluster as a 'gene', we summarize the functional annotations across all genes within the cluster and extract the most popular annotation from each source. The function that does this behind the scenes is dbops.PanSuperclass.init_gene_clusters_functions_summary_dict(). Once we have the list of annotations for each gene cluster bin, we run metabolism estimation as normal. There are two new functions in kegg.py that drive the estimation process for pangenome bins, init_hits_for_pangenome() and estimate_metabolism_for_pangenome_bins().

Note that we don't allow copy number estimation for pangenomes. Since we can use multiple annotation sources for user-defined pathways, this means that each gene cluster can still be associated with more than one function, and multiple synonymous enzyme annotations could overinflate the copy number values -- an issue I just fixed in PR #2176, and don't want to re-introduce. I cannot use the same fix for gene clusters because it depends on gene ID information, which gets complicated for gene clusters. However, if someone downstream really wants to estimate copy number for pangenomes, they should let me know and we can consider the best way to implement that :)

One thing that I did NOT include (yet?) for pangenome input is a check for the modules db hash value, like we have for contigs databases. This is not something that gets stored in the genomes storage DB, and theoretically the hash could be different for each genome in the pangenome, so implementing such a sanity check would get complicated. I also wasn't sure if it was worth adding, considering our planned future updates to pathway definitions that will move away from using the modules db. What this means is that it is possible to estimate metabolism using different KEGG versions than were used to annotate the component genomes in the pangenome.

There is no test case yet for this input mode in anvi-self-test --suite metabolism (though I am currently running that to make sure I didn't break anything else with these changes), since it would require us to create a pangenome first within that script. I considered adding a test in the pangenomics suite instead, but in that case there is no guarantee that KEGG data is set up (we could do it with a small test set of user-defined pathways perhaps). I wonder if it is better to wait until the pathway definition updates are over to revamp the testing.

Feedback welcome :)

FlorianTrigodet commented 8 months ago

Tested and works beautifully!

ivagljiva commented 8 months ago

The metabolism self tests worked, so I'm merging this :)

meren commented 8 months ago

This is awesome. I love how anvi-estimate-metabolism is branching into every part of the codebase. WHAT IS NEXT??! Estimating conserved metabolic modules as a function of branching patterns in phylogenomic trees??!!11

ivagljiva commented 7 months ago

Hmm, now that you mention it @meren , that seems like an interesting direction ;)

merenlab / anvio

Metabolism estimation for gene cluster bins in a pangenome #2177