Open apcamargo opened 6 months ago
Hi @apcamargo,
Thanks for the bug report. I agree that this might be an improvement relative to the current method.
I'm embarrassed to say that I'm having trouble understanding the code that I wrote so many years ago, but let's start from CodonsEngine.calc_pN_pS
(from within anvio.variabilityops
):
If we implemented the Nei-Gojobori method, it seems like we would need to change the calculation of synonymous and non-synonymous fractions.
In the current implementation, every codon allele contributes either to the number of synonymous differences, or the number of non-synonymous differences. The amount it contributes is defined by the allele's frequency.
In contrast, the Nei-Gojobori method says that double- or triple-nucleotide differences contribute twice or three times as much to the number of synonymous and non-synonymous differences, and they have the capacity to contribute to both simultaneously. From the paper:
From what I can tell, implementing the Nei-Gojobori method requires changing how we calculate the number of s- and ns-differences, not how we calculate s- and ns-potentials. I found this worth mentioning because in your post you refer to potentials in a way that is different to how I define potentials in the codebase:
Do you agree with me on this point? If so, I believe that calc_synonymous_fraction
and it's delegate, _calculate_synonymous_fraction
, are the only functions that would need to be refactored. Here is calc_synonymous_fraction
:
calc_synonymous_fraction
essentially synthesizes the required data from self
into numerical arrays that are passed to _calculate_synonymous_fraction
, a just-in-time compiled workhorse, which returns the fraction of synonymous and non-synonymous components.
Based on what I understood from reading the paper, we need to implement the following:
Each step in each mutational path contributes to either a s- or ns-difference (our current method only captures this if the allele differs by a single nucleotide)
The "contribution" of each allele should be twice as much for alleles with 2-nucleotide differences and thrice as much for alleles with 3-nucleotide differences. This is based on the fact that in their GTT -> GTA, s_d + n_d = 1, whereas in their TTT -> GTA example, s_d + n_d = 2
I don't understand this, but I think it may play an important role in their method (which is definitely missing from anvi'o's current implementation)
If I understand it correctly, that formula is to account for silent substitutions. It takes the observed distance between two sequences and computes the estimated distance assuming a uniform substitution rate.
Short description of the problem
When anvi'o computes the potential of a given codon, it does so by evaluating whether mutations in one position (that is, with Hamming distance of 1) generate synonymous or non-synonymous codons. However, the Nei-Gojobori method takes into account the distance between pairs of codons when computing potentials. That is, if you have the
ACC
codon in the reference and aATT
variant, the potential ofACC
is computed by evaluating all possible mutations paths between those two codons.You can find an implementation of the Nei-Gojobori method here.
anvi'o version
System info
anvi'o executed via Docker.