tskit-dev / tsdate

Infer the age of ancestral nodes in a tree sequence.
MIT License
19 stars 10 forks source link

Assessing best avenues for improved date estimation. #166

Open hyanwong opened 3 years ago

hyanwong commented 3 years ago

It struck me that we have several different routes for improving dating: some are to do with simply improving the topologies, some to do with the tsdate algorithm, etc. It would be very useful to know where best to focus future efforts. For most potential improvements, we can assess (via simulation) how much of a difference it would make if we completely solved the problem. We should try working out the impact of each of these n possible improvements, so we know where to focus our efforts. It may also be that a combination of 2 or more works better than just one. If we have a decent metric for times (e.g. RSMLE or spearman's rho) then we can carry out a number of simulations using different demographic models and random seeds, and inspect an nxn matrix of (say) the RSMLE improvement for various combinations.

Here's a list of the n possible ways we can think to improve the process (I may have missed some)

  1. Topology improvement: get the ancestors in the correct time order. Easy to test - just reorder the inferred ancestors in the "correct" order (take the average time of all the focal sites in that ancestor before inference)
  2. Topology improvement: fix ancestral states. Easy to test - just don't include ancestral state error in the inference
  3. Topology+dating improvement: fix sequencing error. Easy to test - just don't inject genotype error (but keep ancestral state polarization error)
  4. Topology improvement: don't overestimate the number of recombination events. Relatively easy to test: find the correct breakpoints in the simulated TS and increase the recombination rate values at those physical locations. It may be worth trying this with and without the default adjustment for physical distance between sites.
  5. Dating improvement: properly account for loopy BP. Not sure how to test this
  6. Dating improvement: include unary nodes. We're unsure if this will be an improvement or not, especially as some of these in our inferred TS seem to be too long.
  7. Dating improvement: variable Ne over time. Harder to test, as we haven't implemented this yet.
  8. Dating improvement: sort out the upper end of the distribution. Easy to test on a small dataset, as we can simply add a load more timeslices at the oldest times, and see how much of a different to the average RMSLE(or whatever) this makes.
  9. Dating improvement: recombination clock. Harder to test, as we haven't implemented this yet. Gil had the idea of dating some events using a pairwise approach like the version implemented in GEVA: I think this is worth exploring. Alternatively for testing purposes only we could keep the recombination breakpoints in the simulation, and somehow pass those to the Inside/Outside algorithm (although this is hard if we are using an inferred topology)
awohns commented 3 years ago

This is a great framework, thanks @hyanwong. Here's the status on each:

  1. In vanilla simulations, the iterative approach gets us about halfway to the accuracy of having perfectly ordered ancestors. This is complicated when we have simulations produced with a recombination map, we haven't sorted out why that is yet but we should
  2. Preliminary view on this is that ancestral state errors are harder to find than empirical error using parsimony. Should explore other avenues to finding these.
  3. Looks like iteration after identifying genotype errors provides a significant accuracy improvement
  4. Haven't looked into this one
  5. Seems like the loops might not be that big an issue after all, especially if we don't have small loops
  6. Requires improvements in tsinfer first. Preliminary investigations seem to indicate, as you say, that the issue is the long ancestors in tsinfer and that tsdate is handling unary nodes correctly. Can test by simulating with unary nodes
  7. Haven't looked into it. Shouldn't be hard to test once we have it. We'd be estimating N(t) too. Could do variable mutation rates too if we wanted
  8. I'm almost positive the issue is not having older timeslices, it's the prior at MRCAs. As we've noted, we're correct marginally, but don't take into account length of ancestors. Since old tsinfer ancestors are too long, this would mostly help with simulated trees for now
  9. Worth reading this: https://www.biorxiv.org/content/10.1101/2020.12.24.424361v1.full.pdf
hyanwong commented 3 years ago
  1. In vanilla simulations, the iterative approach gets us about halfway to the accuracy of having perfectly ordered ancestors.

Ah - I forgot we has actually done the perfectly ordered case. Great. We should have somewhere we collect these metrics.

  1. Preliminary view on this is that ancestral state errors are harder to find than empirical error using parsimony. Should explore other avenues to finding these.

Do we know how much better we do when we don't have ancestral state errors in the simulation? I should have checked this but I'm not sure I did.

  1. Looks like iteration after identifying genotype errors provides a significant accuracy improvement

Yes, but how much better do we do when we have no errors injected?

  1. Worth reading this: https://www.biorxiv.org/content/10.1101/2020.12.24.424361v1.full.pdf

Ah. I had read it, but didn't think of it as immediately relevant to using the recombination clock. I should look again.

swamidass commented 3 years ago

I'm nearly complete with an implementation of #7. Turns out there is a straightforward way to do this with the existing API. Would that implementation be of any interest?

awohns commented 3 years ago

@swamidass, definitely! That sounds fantastic. Feel free to make a pull request with your changes

swamidass commented 3 years ago

Alright. I'll see what I can do. At least consider giving me a middle authorship if it turns out being helpful :) .

awohns commented 3 years ago

Sounds good. And of course, credit will most certainly go where it's due! Right now I don't have firm plans for a study using variable-Ne based inference, but there are of course many applications where this would be very useful. Happy to arrange a video chat to discuss this at some point if you want.

swamidass commented 3 years ago

Let me see about getting a clean implantation with test code to you in the next week or so. If it makes sense to chat then, we can.