hyanwong / treeseq-inference

Work for the tree sequence inference paper.
Apache License 2.0
0 stars 0 forks source link

How do stats deal with polytomies (created by ts-infer) #5

Open hyanwong opened 5 years ago

hyanwong commented 5 years ago

From @hyanwong on January 14, 2017 9:28

Following on from conversation at https://github.com/mcveanlab/treeseq-inference/issues/10#issuecomment-272417263, in which @jeromekelleher pointed out that the reconstructed TS often have polytomies. To recap, our options are:

  1. Hack the output to insert zero length branches and force binary trees (ugly, but practical);
  2. Implement tree metrics in msprime to deal with non binary trees properly (clean, but time consuming) [not entirely sure what Jerome means here]
  3. randomly resolve polytomies and take an average of the distance metric over all, or a sample of, resolved trees (advantage: can use all the standard metrics)
  4. use a metric that performs "properly" (for our purposes) when confronted with polytomies - i.e. it gives the average metric over all possible binary resolutions.

Note that (1) is only useful if we use a metric that takes account of branch lengths. (2 & 4) require investigation of metrics, (3) increases metric variance.

Copied from original issue: mcveanlab/treeseq-inference#11

hyanwong commented 5 years ago

Caroline Colijn says:

Hi Yan

Our metric might allow polytomies (package treescape) -- it certainly does in principle.

If that doesn't work, you could use any of the branch length ones with the polytomy-resolving branches having length 0 and the other branches in your tree set to length 1. ie (a) set all the branch lengths to 1 and then (b) resolve polytomies at random (so that the descending previously-unresolved branch lengths are 0) and then use a length-dependent metric. This won't do that averaging that you want, but should handle the polytomy in the correct way. cheers Caroline

On 13 January 2017 at 12:08, Yan Wong yan@yanwong.me wrote: Hi Caroline - thanks for the chat a while back. We’ve just hit a small snag with tree metrics, and I wonder if you could provide some pointers? The problem is that our reconstruction algorithm produces polytomies. Since the trees are defined for small sections of the genome, they should be resolved, but presumably we have not enough information to do so - that is, they are "soft" polytomies.

We are comparing our reconstructions to other algorithms that produce a likelihood-weighed set of fully resolved trees. So I think we probably want to use a distance metric which allows polytomies, and produces a result which is the average of the metric when calculated over all possible resolutions of the polytomies. I see that the RF metric does not have this property (see R code below). Do you know of any metrics that do? It would be ideal if there was both a rooted and unrooted metric too, but for the moment we don’t care about branch lengths, so a topology-only measure is fine.

Thanks a lot if you can help. Do let me know if this doesn’t make sense.

Yan

hyanwong commented 5 years ago

(3) is implemented, but leaving this open for the other points