apcamargo commented 6 months ago

Short description of the problem

When anvi'o computes the potential of a given codon, it does so by evaluating whether mutations in one position (that is, with Hamming distance of 1) generate synonymous or non-synonymous codons. However, the Nei-Gojobori method takes into account the distance between pairs of codons when computing potentials. That is, if you have the ACC codon in the reference and a ATT variant, the potential of ACC is computed by evaluating all possible mutations paths between those two codons.

You can find an implementation of the Nei-Gojobori method here.

anvi'o version

Anvi'o .......................................: marie (v8)
Python .......................................: 3.10.12

Profile database .............................: 38
Contigs database .............................: 21
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Linux 4.18.0-425.3.1.el8.x86_64 x86_64

anvi'o executed via Docker.

ekiefl commented 6 months ago

Hi @apcamargo,

Thanks for the bug report. I agree that this might be an improvement relative to the current method.

Ramblings

I'm embarrassed to say that I'm having trouble understanding the code that I wrote so many years ago, but let's start from CodonsEngine.calc_pN_pS (from within anvio.variabilityops):

https://github.com/merenlab/anvio/blob/4eb57670bd87949935fdc979b1e9d2a1b73bad94/anvio/variabilityops.py#L2647-L2722

If we implemented the Nei-Gojobori method, it seems like we would need to change the calculation of synonymous and non-synonymous fractions.

In the current implementation, every codon allele contributes either to the number of synonymous differences, or the number of non-synonymous differences. The amount it contributes is defined by the allele's frequency.

In contrast, the Nei-Gojobori method says that double- or triple-nucleotide differences contribute twice or three times as much to the number of synonymous and non-synonymous differences, and they have the capacity to contribute to both simultaneously. From the paper:

2196

Scope of changes

From what I can tell, implementing the Nei-Gojobori method requires changing how we calculate the number of s- and ns-differences, not how we calculate s- and ns-potentials. I found this worth mentioning because in your post you refer to potentials in a way that is different to how I define potentials in the codebase:

https://github.com/merenlab/anvio/blob/4eb57670bd87949935fdc979b1e9d2a1b73bad94/anvio/utils.py#L1646-L1657

Do you agree with me on this point? If so, I believe that calc_synonymous_fraction and it's delegate, _calculate_synonymous_fraction, are the only functions that would need to be refactored. Here is calc_synonymous_fraction:

https://github.com/merenlab/anvio/blob/4eb57670bd87949935fdc979b1e9d2a1b73bad94/anvio/variabilityops.py#L2558-L2606

calc_synonymous_fraction essentially synthesizes the required data from self into numerical arrays that are passed to _calculate_synonymous_fraction, a just-in-time compiled workhorse, which returns the fraction of synonymous and non-synonymous components.

Based on what I understood from reading the paper, we need to implement the following:

Each step in each mutational path contributes to either a s- or ns-difference (our current method only captures this if the allele differs by a single nucleotide)
The "contribution" of each allele should be twice as much for alleles with 2-nucleotide differences and thrice as much for alleles with 3-nucleotide differences. This is based on the fact that in their GTT -> GTA, s_d + n_d = 1, whereas in their TTT -> GTA example, s_d + n_d = 2

Unknowns

I don't understand this, but I think it may play an important role in their method (which is definitely missing from anvi'o's current implementation)