merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
443 stars 145 forks source link

Better calculation of mean coverage Q2Q3 #2366

Closed meren closed 3 weeks ago

meren commented 3 weeks ago

A while ago Julian Torres-Morales and Jessica Mark Welch brought to our attention the fact that the way anvi-summarize cuts corners for performance reasons while aggregating values from different anvi'o views for a given bin in a collection ruins the way mean_coverage_Q2Q3 values are calculated.

At the essence the problem lies the fact that in its quick-and-dirty approach, anvi'o takes every single mean_coverage_Q2Q3 value for every single split in a bin, assigns weights to each value based on the split length, and then averages out the final values to come up with a single value for the entire bin. This works beautifully for mean_coverage, but not for mean_coverage_Q2Q3, since the latter in fact requires all nucleotide level coverage values for each split to be aggregated in a single array to calculate the mean coverage Q2Q3 from scratch.

This PR implements a solution through a new flag for anvi-summarize, --calculate-Q2Q3-carefully, when declared, anvi'o does not cut any corners at the expense of longer compute time in exchange of increased accuracy.

We thank Julian and Jessica for their patience! :)

meren commented 3 weeks ago

I tested this multiple datasets. In cases where contigs in bins were well covered across all samples, the flag had quite negligible impact on final results. But when detection and coverage values more patchy, I did observe considerable deviation from original values with the flag, which is good news.