Closed meren closed 3 weeks ago
I tested this multiple datasets. In cases where contigs in bins were well covered across all samples, the flag had quite negligible impact on final results. But when detection and coverage values more patchy, I did observe considerable deviation from original values with the flag, which is good news.
A while ago Julian Torres-Morales and Jessica Mark Welch brought to our attention the fact that the way
anvi-summarize
cuts corners for performance reasons while aggregating values from different anvi'o views for a given bin in a collection ruins the waymean_coverage_Q2Q3
values are calculated.At the essence the problem lies the fact that in its quick-and-dirty approach, anvi'o takes every single
mean_coverage_Q2Q3
value for every single split in a bin, assigns weights to each value based on the split length, and then averages out the final values to come up with a single value for the entire bin. This works beautifully formean_coverage
, but not formean_coverage_Q2Q3
, since the latter in fact requires all nucleotide level coverage values for each split to be aggregated in a single array to calculate the mean coverage Q2Q3 from scratch.This PR implements a solution through a new flag for
anvi-summarize
,--calculate-Q2Q3-carefully
, when declared, anvi'o does not cut any corners at the expense of longer compute time in exchange of increased accuracy.We thank Julian and Jessica for their patience! :)