Open hyanwong opened 3 years ago
This is a great framework, thanks @hyanwong. Here's the status on each:
- In vanilla simulations, the iterative approach gets us about halfway to the accuracy of having perfectly ordered ancestors.
Ah - I forgot we has actually done the perfectly ordered case. Great. We should have somewhere we collect these metrics.
- Preliminary view on this is that ancestral state errors are harder to find than empirical error using parsimony. Should explore other avenues to finding these.
Do we know how much better we do when we don't have ancestral state errors in the simulation? I should have checked this but I'm not sure I did.
- Looks like iteration after identifying genotype errors provides a significant accuracy improvement
Yes, but how much better do we do when we have no errors injected?
- Worth reading this: https://www.biorxiv.org/content/10.1101/2020.12.24.424361v1.full.pdf
Ah. I had read it, but didn't think of it as immediately relevant to using the recombination clock. I should look again.
I'm nearly complete with an implementation of #7. Turns out there is a straightforward way to do this with the existing API. Would that implementation be of any interest?
@swamidass, definitely! That sounds fantastic. Feel free to make a pull request with your changes
Alright. I'll see what I can do. At least consider giving me a middle authorship if it turns out being helpful :) .
Sounds good. And of course, credit will most certainly go where it's due! Right now I don't have firm plans for a study using variable-Ne based inference, but there are of course many applications where this would be very useful. Happy to arrange a video chat to discuss this at some point if you want.
Let me see about getting a clean implantation with test code to you in the next week or so. If it makes sense to chat then, we can.
It struck me that we have several different routes for improving dating: some are to do with simply improving the topologies, some to do with the tsdate algorithm, etc. It would be very useful to know where best to focus future efforts. For most potential improvements, we can assess (via simulation) how much of a difference it would make if we completely solved the problem. We should try working out the impact of each of these
n
possible improvements, so we know where to focus our efforts. It may also be that a combination of 2 or more works better than just one. If we have a decent metric for times (e.g. RSMLE or spearman's rho) then we can carry out a number of simulations using different demographic models and random seeds, and inspect ann
xn
matrix of (say) the RSMLE improvement for various combinations.Here's a list of the n possible ways we can think to improve the process (I may have missed some)