New GeometricSeqPhenotype class

thienktran commented 2 years ago

Description

GeometricSeqPhenotype is a subclass of GeometricPhenotype. Along with the parameters that determines the position of the object in Euclidean space, it stores nucleotide sequence information and the number of epitope and non-epitope mutations in fields. Only the nucleotide sequence is being stored to improve space and time complexity. The translation of a nucleotide sequence to a protein sequence can be done after the simulation. However, this does not mean amino acids don't play an important role in our model. For each site in the sequence, a matrix of vectors is precomputed before the simulation runs. The vectors are drawn from a gamma distribution, whose parameters can be changed in parameters.yml. The number of epitope sites, which ranges from 0 to (the starting sequence length / 3), is parameterized in that file as well. Epitope sites remain the same until the program exits. Epitope sites have corresponding matrices with vectors drawn from a gamma distribution with different parameters than non-epiope sites.

mutate()

This method randomly selects an index in the nucleotide sequence to mutate. Based on the transition-transversion ratio, a cumulative sum distribution array is used to determine what the nucleotide will mutate to. If the mutant amino acid is a stop codon, the process repeats starting with randomly selecting an index to mutate. Once a valid mutation occurs, where the object moves in space is determined by the vectors in the site's matrix at entry m, n which represents the index of the wild type and mutant amino acids in Parameters.AMINO_ACIDS, respectively. this does not update. Instead, a new GeometricSeqPhenotype with the updated nucleotide sequence, new position parameters, and number of epitope and non-epitope mutations is created. There must be a mutation if this method is called, so the nucleotide sequence must be different and the number of epitope mutations xor non-epitope mutations must be updated by 1. However, the new position parameters might not change since a nucleotide mutation does not necessarily cause a mutation in the protein sequence.

Closes #17

Tests

GeometricSeqPhenotype’s representation invariant, a condition that must be true throughout an object’s existence, is verified throughout an entire simulation using the debug flag in GeometricSeqPhenotype.java. Whenever a method of the object is called, the representation invariant is checked to confirm that the nucleotide sequence doesn’t change in length (point mutation rather than frameshift mutation) or contain any stop codons.

The JUnit tests in TestGeometricSeqPhenotype.java are used to make sure the constructors and methods of GeometricSeqPhenotype.java are working as expected. GeometricSeqPhenotype():

getTraits()
getSequence()
distance()
mutate()
riskOfInfection()
toString()

The values returned by the constructors and methods are tested against the results calculated by hand. These JUnit tests only review logic errors. For example, testMutate() makes sure that a nucleotide at the given site in the nucleotide sequence is mutating to the given nucleotide. It does not test that the ratio of the number of transitions to the number of transversions.

In order to do sanity checks that Antigen is actually taking biology and statistics into account, we instead have to visualize data from the simulation separately.

Transition-Transversion Ratio

The transition-transversion ratio can be specified in parameters.yml. The default value is 5.0. Each nucleotide mutation in a simulation is recorded and saved in a CSV file, mutations.csv. mutations.csv has one column with the following format: XY, where X is the wild type nucleotide and Y is the mutant nucleotide.

The graph below shows the frequency of each possible nucleotide mutation. Notice the number of transition mutation occurs more frequently than the number of transversion mutation. The calculated ratio is 5.087. It’s not exactly 5.0 for various reasons. The starting sequence doesn’t contain the same number of each nucleotide, and some mutations cause stop codons so it must be mutated again.

Epitope and Non-epitope sites ~Gamma

The effects of an epitope or non-epitope mutation can be specified in parameters.yml. For each amino acid site’s corresponding matrix of vectors, the mutation notation, size of vector, and theta of each entry are recorded and saved in a CSV file, test/valuesGammaDistribution/0_siteX.csv where X is the amino acid site number.

The graph for non-epitope sites (meanStep: 0.0001 and sdStep: 0.0001) show that mutations that occur in non-epitope sites don’t move the phenotype very far in antigenic space.

The graph for epitope sites (meanStep: 2.0 and sdStep: 1.0) are slightly different for each amino acid site, which is what we want. All epitope site distributions are consistent. (The orange line is the distribution used in the original Antigen and is used for reference).

Checklist:

[x] The code uses informative and accurate variable and function names
[x] The functionality is factored out into functions and methods with logical interfaces
[x] Comments are up to date, document intent, and there are no commented-out code blocks
[x] Commenting and/or documentation is sufficient for others to be able to understand intent and implementation
[x] TODOs have been eliminated from the code
[x] The corresponding issue number (e.g. #278) has been searched for in the code to find relevant notes
[ ] Documentation has been redeployed

(Sorry about all the nontrivial updates. Here’s a reminder on how to hide white space changes).

Haddox commented 2 years ago

@thienktran: just finished. Nice job! A few other suggestions:

do we need to include avg_prefs.csv in the PR? Does this need to be tracked on GitHub? If you'd like to include it, could you please move it to a directory for storing input data (e.g., 'data/')
remember to rename needle in hay stack script :)

Could you please let @matsen know when you've finished addressing the above comments, so that he knows when to start looking things over? Thanks!

Haddox commented 2 years ago

@thienktran: just a reminder to please look into the odd pattern we were seeing, where some sites get mutated a lot less than others. Our hypothesis is that these sites correspond to codons that are a single nucleotide mutation away from a stop codon. Likely, for the ones that are most problematic, this mutation is a transition mutation.

Haddox commented 2 years ago

Please see issue #23 before merging this PR.

Haddox commented 2 years ago

@thienktran: looks good to me. Thanks for making that fix!

thienktran commented 2 years ago

Hi, @matsen! Would you mind looking over this PR for us? Thank you!

matsen commented 2 years ago

Sure! Did adding the HKY model slow things down a lot? Are the sequences the bottleneck?

thienktran commented 2 years ago

@matsen, we have not started implementing the HKY model yet. I have a few suggestions on how to implement Hugh's idea and as well as yours. Hugh, Zorian, and I still need to chat to figure out which idea makes the most sense to go forward with.

The sequences are not the bottlenecks in this version. It will probably be an issue once we implement the HKY model though.

matsengrp / antigen