joshuachristie / timeseries-inference

Inferring parameters of evolutionary models from allele frequency data
0 stars 0 forks source link

finalise conditioning (and randomising) scheme for train/test #8

Closed joshuachristie closed 3 years ago

joshuachristie commented 3 years ago

In an ideal world, I can train directly on the raw outputs. But for particularly short trajectories (e.g. the allele goes extinct in one or two generations), there's simply nothing to work with. While I could just accept that the accuracy for these cases will be low no matter what I choose, including these might impair accuracy for longer trajectories (i.e. it would be better to just condition on alleles having survived for x generations). The major advantage of this approach would be that I could train a single model (if I instead conditioned on having survived for x generations, then I need to train a separate model for every value of x that I choose -- otherwise the train set won't be properly balanced).

Ultimately, this comes down to how I (or rather an investigator) decides to pick a particular evolutionary trajectory and asks "what is the function of this allele's effect"?

A second consideration is how to choose the windows of time. I could just consider the entire time series, but ideally I could randomly choose a region of the time series. This would mean that one wouldn't need to observe the allele's evolution from the beginning. It's not obvious how feasible this is, though, considering that some trajectories are highly truncated already (that is to say that randomizing where the window starts seems more feasible if conditioning on trajectories that have survived for x generations).

joshuachristie commented 3 years ago

The first step is clearly defining a set of criteria required of the question that a hypothetical investigator would ask about a particular instance of evolution. The first part (the question) is quite straightforward: given a trait T whose phenotype is affected by allele A (via effect F), the investigator, who is observing the frequency of A over time, will ask something like "what is the biological function of F with respect to T?" (inferred from the time-series A). The second part of the question is less straight forward, however.

Consider the following: why did the investigator choose that evolutionary instance? Was it a randomly selected example from an experiment? Was it an observation from nature? If the latter, why did the investigator choose that observation (pseudo-randomly, because it was "interesting", etc.)? This matters because how an investigator chooses to select a evolutionary instance has implications for non-random conditioning on model parameters and/or realised outcomes. It is crucial that I match how I test the performance of the model with how I develop the model.

It's important that my thinking is crystal clear here, so I'm going to work through a few examples to highlight the potential problems here. I will compare an investigator choosing an evolutionary trajectory using 4 different approaches: (i) choosing an "interesting" observation; (ii) choosing a pseudo-random observation; (iii) choosing a random outcome of an experiment; or (iv) choosing a random outcome from a subset of experimental outcomes (where the subset comes from conditioning on a predefined criterion).

For (i), the investigator will be biased towards evolutionary instances that have persisted for a long period of time, as these are traits that we find "interesting" (i.e. biased towards realised trajectories that are longer than expectation and biased towards parameter values that give higher fixation probabilities). For (ii), the investigator will again be biased towards evolutionary instances that have persisted for a longer-than-average period of time (but not to the same degree as (i)---here the bias comes from the fact that the probability of observing an evolutionary instance is proportional to how long it has persisted...you can't observe something that has gone extinct). For (iii), the investigator should theoretically be unbiased but most examples will be uninteresting due to the nature of evolutionary outcomes (very short time-series that quickly go extinct). For (iv), the investigator is (explicitly) biased towards evolutionary instances that have persisted for longer than average, but they can a priori specify what counts as an interesting evolutionary instance. (Note that arguably (i) and (iv) get at the same point but (iv) is explicit about the bias due to selection of "interesting" cases.)

For (i), there's no way of accounting for the bias (unless the investigator is explicit about defining "interesting", in which case we have (iv)).

For (ii), it depends on the specific selection scheme, but we might assume that the probability of selecting a trajectory from a particular set of parameter values is proportional to its expected length. The probability of choosing a specific trajectory from the set of trajectories generated from this set of parameter values will also depend on the variance of trajectories for that set of parameter values.

For (iii), we have no biases but many of the instances that we choose will be mutations that quickly go extinct. Even leaving aside the fact that these are less interesting, they are problematic because they contain a lower signal to noise ratio. This leads to a couple of implications. First, I'll never get accurate predictions for these short trajectories. (For example, if a mutation goes extinct in a single generation then I simply have two data points---there are no statistical properties here for differentiating between parameter values. Although I can "work with" trajectories that persist for at least one generation, there's obviously very little statistical information in a time series of length 3 (or 4 or 5, etc.).). Second, including these short trajectories (read: including noisy data with a minute or nonexistent signal), I will presumably worsen the performance of my model on longer trajectories. It's a bit as if I just added a bunch of white noise to the training set. Furthermore, those parameter value combinations that lead to a low persistence probability will be dominated by these short trajectories during training (further worsening model performance on longer trajectories, as these parameter value combinations will only have a handful of trajectories with a reasonable signal to noise ratio).

For (iv), I'm effectively adding a bias (via conditioning) to iii. This hurts generality but it improves the quality of the training set. For example, I could condition equally on trajectories for both train and valid data (e.g. only train on trajectories that persist for at least 10 generations and likewise only try to predict trajectories that persist for at least 10 generations). It hurts generality because the investigator needs to set the threshold of what is considered "interesting" and the model is optimised around that decision. Of course it doesn't prevent an investigator from predicting shorter trajectories---e.g. if I conditioned on a length >= 10 I could still make a prediction a time series of length 5---but the model is not being explicitly optimised for these predictions so the prediction probably won't be very accurate (but ultimately it isn't going to be accurate in any case). Obviously by conditioning on lengths >= 10, every example in the train set will have survived at least 5 generations---so there is plenty of data from which to make an inference. However, this dataset is biased because it conditions on trajectories that survive 5 generations AND also survive 10 generations, which filters out all the trajectories that survive 5 generations BUT NOT 10 generations (and if all we know is that our trajectory of interest has persisted 5 generations then we ultimately don't know whether it will persist for 10 generations).

Now some of the above is overkill for my particular situation. The probability that I choose a trajectory need not be influenced by the realised trajectory length. For example, if I sample a random trajectory, then so long as the number of replicates are balanced, my probability of choosing a sample is not affected by the length of the trajectory. So some of the concerns about (i) and (ii) pointed out above do not apply (though it's worth thinking through them, as they would apply to a real-life investigator observing a species and choosing a trait to study---I should make this point somewhere in the discussion of the paper).

So how to proceed?

First, I think I should simplify things by considering the entirety of each time series from generation 0. Ideally, I could make predictions for longer trajectories in the test set starting from a random generation, but I think this is overcomplicating things. I can view it more as running an experiment (in which the investigator can observe from the beginning) as opposed to an observation study (in which the observer might start part way through the spread of a trait). Of course, as with shorter trajectories, there's nothing stopping me from making predictions on time series starting from a random generation and it might work quite well, but I don't think I should be specifically trying to optimise for this (and I suspect that there would be a trade-off here in that much more of the train set would comprise examples that will be rarely encountered when it comes to prediction).

So that just leaves how to choose the threshold for inclusion in the train/test sets (i.e. minimum number of generations that a trajectory must persist for). There is a trade-off here. I want to avoid making it too high, as this might exclude some potentially "interesting" trajectory lengths, will increase the time for the C++ simulations to run and DL models to train, and will increase the output file sizes. But I also want to avoid making it too low, as this might reduce accuracy of the model. Ultimately I'll have to experiment a bit here---but generally speaking I'll want to choose a threshold at which the prediction accuracy starts to plateau. While I'm playing around with the models, I think I can just choose something like 10 generations.

joshuachristie commented 3 years ago

This consideration is much less important now that I have reworked the data pipeline. It's very easy for me to play around with this threshold. I think the best approach will be to show how the accuracy changes as I consider shorter/longer sequences. Of course, I can still predict shorter sequences than what I trained on. So I might consider reporting accuracy on test sequences under the same conditional assumptions (e.g. only those trials surviving x generations) and separately reporting how well it does on shorter sequences (maybe even separately showing for survival of gen 1, 2, 3, 4, etc.)