merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

[BUG] Potential computation in `anvi-get-pn-ps-ratio` is incorrect #2195

Open apcamargo opened 6 months ago

apcamargo commented 6 months ago

Short description of the problem

When anvi'o computes the potential of a given codon, it does so by evaluating whether mutations in one position (that is, with Hamming distance of 1) generate synonymous or non-synonymous codons. However, the Nei-Gojobori method takes into account the distance between pairs of codons when computing potentials. That is, if you have the ACC codon in the reference and a ATT variant, the potential of ACC is computed by evaluating all possible mutations paths between those two codons.

You can find an implementation of the Nei-Gojobori method here.

anvi'o version

Anvi'o .......................................: marie (v8)
Python .......................................: 3.10.12

Profile database .............................: 38
Contigs database .............................: 21
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Linux 4.18.0-425.3.1.el8.x86_64 x86_64

anvi'o executed via Docker.

ekiefl commented 6 months ago

Hi @apcamargo,

Thanks for the bug report. I agree that this might be an improvement relative to the current method.

Ramblings

I'm embarrassed to say that I'm having trouble understanding the code that I wrote so many years ago, but let's start from CodonsEngine.calc_pN_pS (from within anvio.variabilityops):

https://github.com/merenlab/anvio/blob/4eb57670bd87949935fdc979b1e9d2a1b73bad94/anvio/variabilityops.py#L2647-L2722

If we implemented the Nei-Gojobori method, it seems like we would need to change the calculation of synonymous and non-synonymous fractions.

In the current implementation, every codon allele contributes either to the number of synonymous differences, or the number of non-synonymous differences. The amount it contributes is defined by the allele's frequency.

In contrast, the Nei-Gojobori method says that double- or triple-nucleotide differences contribute twice or three times as much to the number of synonymous and non-synonymous differences, and they have the capacity to contribute to both simultaneously. From the paper:

image

2196

Scope of changes

From what I can tell, implementing the Nei-Gojobori method requires changing how we calculate the number of s- and ns-differences, not how we calculate s- and ns-potentials. I found this worth mentioning because in your post you refer to potentials in a way that is different to how I define potentials in the codebase:

https://github.com/merenlab/anvio/blob/4eb57670bd87949935fdc979b1e9d2a1b73bad94/anvio/utils.py#L1646-L1657

Do you agree with me on this point? If so, I believe that calc_synonymous_fraction and it's delegate, _calculate_synonymous_fraction, are the only functions that would need to be refactored. Here is calc_synonymous_fraction:

https://github.com/merenlab/anvio/blob/4eb57670bd87949935fdc979b1e9d2a1b73bad94/anvio/variabilityops.py#L2558-L2606

calc_synonymous_fraction essentially synthesizes the required data from self into numerical arrays that are passed to _calculate_synonymous_fraction, a just-in-time compiled workhorse, which returns the fraction of synonymous and non-synonymous components.

Based on what I understood from reading the paper, we need to implement the following:

Unknowns

I don't understand this, but I think it may play an important role in their method (which is definitely missing from anvi'o's current implementation)

image

apcamargo commented 6 months ago

If I understand it correctly, that formula is to account for silent substitutions. It takes the observed distance between two sequences and computes the estimated distance assuming a uniform substitution rate.