merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

Improving contig/MAG abundance calculations by storing total number of reads in metagenomes in profile databases #1613

Open tdelmont opened 3 years ago

tdelmont commented 3 years ago

The need

Get the relative abundance of MAGs across samples. We have a lot of confusion (rightfully in my opinion) within the anvi'o community about the abundance numbers in the summary output.

The solution

Offer the option to incorporate a text file with sample.name as column A and number.of.metagenomic.reads as column B into PROFILE.db.

Then, during summary the relative abundance of MAGs would be calculated properly. The part would be left blank for those that did not import this table. A warning during anvi-summarize would be nice to remind the user of this option.

Beneficiaries

All of those doing genome-resolved metagenomics or metapangenomics, and interested in the portion of reads that mapped to genes or genomes.

Just a though, based on a recent comment in the anvi'o slack.

Best

Tom

meren commented 3 years ago

When people use the anvi'o Snakemake workflows, the actual number of reads are stored in the misc data tables for single and merged profile databases automatically under the name total_num_reads.

The same mechanism can be used to import a misc data layers information via the program anvi-import-misc-data.

We could implement a new routine in anvi-summarize that could take into consideration those numbers to calculate abundance once and for all. Actually we could use the same for even anvi-profile.

I will leave this here with this information, but I this will be post-v7.

Thank you for submitting this, Tom! :)