Inferring parameters of evolutionary models from allele frequency data
This pipeline ingests .tfrecord
files generated by C++ simulations from a model of evolutionary dynamics.
The goal is to solve an "inverse problem" (inferring the parameters of a causal model/generative process from data).
In a typical inverse problem, one uses maximum likelihood or Bayesian inference to approximate the divergence between i.i.d samples (the data) and a (causal)
model's predictions.
The twist here is that the data are not i.i.d samples from the same generative process but rather a single time-series (stochastic sample) from a generative process.
The stochastic time-series sample is a realisation of one of many possible trajectories (sequence of frequencies) that a trait could take in a population over time.
The evolutionary model has parameter values that can be set. For each combination of parameter values, a simulation of a trait's evolution is run multiple times (independently). For each stochastic run, a simulation has two absorbing states: the trait can (i) go extinct (frequency of 0) or (ii) become fixed (frequency of 1). These trajectories are highly stochastic and their lengths can vary widely (anywhere from one time-step up to (a theoretical max of) 1,000,000 time steps). The goal is to train an LSTM on these stochastic trajectories, learning to infer the parameter values of the evolutionary model from these time-series data. The model is trained on 600 combinations of the evolutionary model's parameters (using latin hypercube sampling). Each of the 600 combinations contains 50,000 stochastic trajectories of trait frequencies over time. Each of these 50,000 stochastic trajectories represents one outcome (out of many possible outcomes) that a trait could take while evolving.
To evaluate the LSTM's performance, I measure the mean absolute error (MAE) with respect to the LSTM's estimate of the evolutionary model's parameters. These errors are averaged over the stochastic trajectories from random combinations of the evolutionary model's parameter values (the test set contains 200 combinations, each of which has 50,000 trajectories). When fed to the LSTM, each of the evolutionary model's parameters are normalised from 0 to 1; thus the MAE represents the mean percentage error in estimating the model's parameters (e.g. a MAE of 0.03 means that the estimation of the evolutionary model's parameters were off by 3% on average).