tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Per-tree-sequence polytomy resolution method #2673

Open hyanwong opened 1 year ago

hyanwong commented 1 year ago

I was chatting with @a-ignatieva at PopGroup56 and she said she had a couple of use-cases for a method that randomly broke polytomies in an entire tree sequence, but didn't do so tree-by-tree. I had some code to do this (initial example in https://github.com/tskit-dev/tskit/issues/809), but @jeromekelleher had reasons for wanting to make the random-polytomy-breaking routine independent from tree-to-tree (see https://github.com/tskit-dev/tskit/pull/815#issuecomment-730710610)

Perhaps Ana could comment in this issue about why she wanted a function that worked over the entire TS rather than the tree? I would have found it useful too (see https://github.com/tskit-dev/tskit/pull/815#issuecomment-744035334), so it may be worth implementing if there is a wider a call for it. I suspect that my implementation was overwritten when the PR (https://github.com/tskit-dev/tskit/pull/815) was updated to perform only a single-tree resolution, but I guess the logic in https://github.com/tskit-dev/tskit/issues/809#issue-687510678 is roughly what I used.

a-ignatieva commented 1 year ago

Thanks @hyanwong! My use cases involved wanting to take a ts with polytomies, split the polytomies randomly, and then run tsdate on it to get branch lengths. My understanding is that currently you can split polytomies by tree, but this is then quite inefficient and slow for the time estimation step. Also breaking them up tree-by-tree obviously ruins the idea of edge span along the genome, which I wanted to preserve in the no-polytomies ts.

hyanwong commented 1 year ago

My use case was comparing the treewise KC distance with and without random polytomy splitting. Again, doing this per-tree is inefficient when using the incremental TreeSequence.kc_distance function, so much so that it takes up more time than the inference process itself.

There might be some issue here with non-independence among newly introduced (polytomy-splitting) nodes, so I don't know if the variance in total tree sequence KC distance will be biased in any way, but I think the mean will be unbiased under multiple replicates of applying a TS-wide polytomy splitting method.

jeromekelleher commented 1 year ago

My original concerns about the lack of a model here still stand: in the single tree case we have the fairly simple and well understood case of randomly sampling a binary tree uniformly from the set of all binary trees. For the ARG case I'm not sure there is any research on this, so I'm pretty reluctant to add functionality to tskit that's "original research" and not very excited about the potential for doing that research ourselves in the process of adding this functionality.

To me, this is a simulation problem --- take an ARG node that has > k children and insert a randomly generated ARG with k samples under some model. The only model I know of is the coalescent, since I don't think anyone has described a combinatorial model for ARGs. Nailing down what the properties would be needed of the parent ARG (does it need to be a "full" ARG or can it be an msprime-like one?) to enable this is also something that would need to be sorted out.

These are interesting research problems - but consequently aren't something we are going to resolve in an GitHub issue.

hyanwong commented 1 year ago

Thanks @jeromekelleher - that's a really useful clarification. I guess @a-ignatieva might want to comment. My take is that if it's stopping people doing useful things with our tree sequences then it's something that is worth thinking about.

I implemented a version to do this because I needed to for testing the use of the KC distance as an ARG comparison metric. A halfway house (and this has come up before) would be to store the code to do this somewhere, but not in the tskit library itself. This also applies to other "useful functions" too, which I often end up re-implementing (possibly badly) having not realised that there exists an implementation online somewhere that isn't in the core tskit library. I guess this turns into an issue of discoverability (and testability) of code snippets to do particular tasks. I have been putting them into the GitHub discussions as a "show and tell" item, but at the moment that doesn't seem very discoverable.

a-ignatieva commented 1 year ago

Thanks @jeromekelleher, all good points. I guess in my applications the sorts of things I commonly want to do involve working out some quantities under (an approximation to) the coalescent with recombination, and then taking inferred ARGs and checking them against the theory. But then the polytomies clearly need to be dealt with somehow - and for cases where I care about the whole ARG rather than each local tree independently, breaking them up tree-by-tree doesn't do the trick. But I get your point that the right method of doing this is probably an open problem - just feels like an important one to me for working with tsinfer type ARGs.

jeromekelleher commented 1 year ago

I think it's a pretty big open question in general how you make an ARG that comes out of Relate or tsinfer correspond to the classical output-of-the-coalescent-with-recombination process object. Since dealing with polytomies is either pretty easy (just stop making assumptions about having two children in code) or very difficult (e.g., what is the likelihood of something with a polytomy under a model that assumes binary trees) I see very little value in polytomy resolution algorithms --- it's just pulling stuff out of a hat.

I know this is unhelpful, sorry, but I think there are deep questions to be answered here and providing easy---but probably wrong---solutions is not something tskit should do.

hyanwong commented 1 year ago

If we have code snippets that do useful things, which we want to point to in tskit but not be part of the library, then perhaps we could provide a link within the docstring (in this case, the one for tree.resolve_polytomies). For example, we could link to a GitHub discussions Q&A "How can I break polytomies efficiently over a whole tree sequence". I suggest a Q&A because then we could mark something as an answer, but if the routine is improved, we could mark an improved submission as a better answer.

I don't know if linking from the docstring implies a maintenance burden, but I think we already have links within docstrings to e.g. GitHub issues, so maybe this is a reasonable halfway house which would aid discovery of these additional non-core routines?

hyanwong commented 1 year ago

I know this is unhelpful, sorry, but I think there are deep questions to be answered here and providing easy---but probably wrong---solutions is not something tskit should do.

FWIW, I don't see tskit tree sequences as having to conform to any particular model, as long as they represent ancestry of some description. For example, we are happy to use tskit to encode ancestries produced by forwards simulations even if wildly unbiological, or to encode trees like tskit.Tree.generate_balanced(1000). Whether or not we can generate something valid or invalid under the CwR isn't really a tskit worry, is it? In my book, tskit shouldn't even "know" about specific models or sampling from them: that's the role of msprime and friends.

jeromekelleher commented 1 year ago

Well exactly - that's why we shouldn't get into the business of doing these simulations in tskit. Generating random topologies must be done under some model, and tskit isn't a simulator (beyond the simple combinatorial ones you mention).

To be clear, I regard "polytomy breaking" as equal to "simulating ancestry under some model".

a-ignatieva commented 1 year ago

Fair point - I guess I thought of it as more in the remit of tsinfer (just a post-processing step to force the ARG to be binary), rather than tskit itself.

hyanwong commented 1 year ago

I still think a Q&A answer linked from the Tree.spilt_polytomies function is a low cost way of making something useable without it being in the core library.