tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

add Robinson-Foulds distance(s) #995

Open grahamgower opened 3 years ago

grahamgower commented 3 years ago

Something like this: https://gist.github.com/grahamgower/8ca50b9d889e4691d0faca7ace93dac2

hyanwong commented 3 years ago

A +++ vote for this from me! This should complement the kc_distance() method that we currently have. Eventually we might hope to have versions that are windowed too, see #996

hyanwong commented 3 years ago

Also note that although often (and rightly) criticised for lack of sensitivity and counterintuitive behaviour, in practice it actually performs quite well for ARG comparisons: https://doi.org/10.1093/sysbio/syu085 (esp fig 2):

"The RF metric has been criticized for too rapidly reaching a plateau, but in this study it performed well, as did its branch-length variant RFL. Apparently neither recombinational rearrangement nor deteriorating tree inference easily drive RF to its maximum, ... This .. shows that theoretical reasoning about distance metrics is not a good substitute for testing their performance."

We also find that the RF metric comparing 2 tree sequences is quite a good discriminator. It would be useful to know the validity of the metric for trees with polytomies - I have seen conflicting information about this (e.g. the documentation in the R package 'phangorn` claims that polytomies invalidate the RF metric, but in practise they don't seem to, in my experience. Do you know anything about this @grahamgower ?

hyanwong commented 3 years ago

A couple of extra observations. The elegant code posted by @grahamgower doesn't have an L0 (i.e. unweighted, topology only) distance. If I stick it in, I don't seem to get the right answer for the rooted unweighted RF. E.g. I think the following should give a distance of 10.0

>>> ts1, ts2 = msprime.simulate(10, random_seed=1, num_replicates=2)
>>> print(rf_distance(orig_ts, cmp_ts, lambda a, b : branches_pnorm(a, b, 0)))
23.0

My misunderstanding, I'm sure, but worth checking what we are supposed to return.

Also, sadly, doing this for larger trees with recombinations is about 3 times slower than loading the trees into Dendropy and running dendropy.calculate.treecompare.symmetric_difference on each genome segment.

jeromekelleher commented 3 years ago

Also, sadly, doing this for larger trees with recombinations is about 3 times slower than loading the trees into Dendropy and running dendropy.calculate.treecompare.symmetric_difference on each genome segment.

@grahamgower did say the code wasn't efficient, so no surprises there.

grahamgower commented 3 years ago

We also find that the RF metric comparing 2 tree sequences is quite a good discriminator. It would be useful to know the validity of the metric for trees with polytomies - I have seen conflicting information about this (e.g. the documentation in the R package 'phangorn` claims that polytomies invalidate the RF metric, but in practise they don't seem to, in my experience. Do you know anything about this @grahamgower ?

~Maybe this has something to do with rooted vs. unrooted trees? For unrooted trees, you define "splits" of a tree, where you cut the tree on some internal branch, and are left with two sets of leaves. If one of your trees has a polytomy that the other tree doesn't, then those trees will have no splits in common. But for rooted trees, you can define the "splits" of the topology differently, by considering just one side of the split (i.e. the descendent leaves under your cut---ignoring individuals outside of that clade). In this case, you could have polytomies in one tree but not the other, yet still have clades in common between both trees.~

I don't seem to get the right answer for the rooted unweighted RF.

Sorry @hyanwong, I never used this code for anything important, and so it never actually got tested. Caveat emptor! My code uses the rooted-tree definition, so maybe that's the difference? Maybe I did something else wrong, out of ignorance of the field in general?

Also, sadly, doing this for larger trees with recombinations is about 3 times slower than loading the trees into Dendropy and running dendropy.calculate.treecompare.symmetric_difference on each genome segment.

Luckily, there are some obvious things that could be done to speed this up.

grahamgower commented 3 years ago

Actually, I'm probably just confused with what I said about the polytomies. Pretend I didn't say anything. :)

hyanwong commented 3 years ago

Maybe this has something to do with rooted vs. unrooted trees?

Interesting thought (despite your addendum). Thanks. I might ask some phylogeneticist friends.

Sorry @hyanwong, I never used this code for anything important, and so it never actually got tested. Caveat emptor! My code uses the rooted-tree definition, so maybe that's the difference?

I think my expectation is the rooted difference. But I leave it here just for reference anyway.

grahamgower commented 3 years ago

A couple of extra observations. The elegant code posted by @grahamgower doesn't have an L0 (i.e. unweighted, topology only) distance. If I stick it in, I don't seem to get the right answer for the rooted unweighted RF. E.g. I think the following should give a distance of 10.0

>>> ts1, ts2 = msprime.simulate(10, random_seed=1, num_replicates=2)
>>> print(rf_distance(orig_ts, cmp_ts, lambda a, b : branches_pnorm(a, b, 0)))
23.0

My misunderstanding, I'm sure, but worth checking what we are supposed to return.

Also, sadly, doing this for larger trees with recombinations is about 3 times slower than loading the trees into Dendropy and running dendropy.calculate.treecompare.symmetric_difference on each genome segment.

It seems that using branches_pnorm() was just doing the wrong thing for the special p=0 case. I've updated the gist.

>>> from treedist import *
>>> ts1, ts2 = msprime.simulate(10, random_seed=1, num_replicates=2)
>>> rf_distance(ts1, ts2, lambda a, b : branches_pnorm(a, b, 0))
10.0

If you only care about the unweighted case, then the code that extracts the branch lengths could be chopped out, in addition to not storing the leaf or root nodes.

It would be useful to know the validity of the metric for trees with polytomies

What should be the distance between a binary tree and a star tree? Probably most software will calculate n-2 for n leaves, because of the n-2 internal nodes in the binary tree that won't exist in the star tree. That seems reasonable to me, as it's half the diameter of the tree space.

hyanwong commented 3 years ago

Thanks for the updates @grahamgower . I've just come across this paper: https://doi.org/10.1093/bioinformatics/btaa614 which looks interesting and discusses the RF metric in some detail. As it happens I had a previous correspondence with the author, so I've just sent him an email to see if he has any helpful opinions. It would be interesting to see if we could implement his information-theoretic splits measures. This para in particular caught my eye, as it is relevant to what we are trying to do in tsinfer (and which might be useful for stdpopsim folk):

"In cases where one tree can be deemed ‘correct’—for instance, studies that test the efficacy of phylogenetic methods by analysing datasets simulated on a known tree, or comparisons between inferred trees and a trusted reference topology —a metric may not be desired. Subtracting the similarity score from the information content of all splits in the ‘correct’ tree gives the distance from the ‘correct’ tree to a reconstructed tree, reflecting the amount of information that has been correctly reconstructed."

(NB FWIW I gather that he uses R more than Python, but I assume the algorithms translate easily. It would also be a win for us if tskit ends up as one of the only Python packages that provide these metrics on trees)

jeromekelleher commented 3 years ago

As an aside, it occurs to me that there may be a general framework for tree distance metrics much like the one we have for single site stats. That's a substantial research project, though, of course.

hyanwong commented 3 years ago

Edited and salient parts of an email reply from Martin Smith. I'm speaking to him tomorrow afternoon. Ping me if anyone wants to join the call.

If you are working with trees that contain the same leaf sets, and fewer than c. 4000 leaves, then the R implementation of the information theoretic & RF distances in my CRAN 'TreeDist' package should be suitable for your purposes; the R package performs most of the calculations in C++ code that I've optimised for efficiency as far as I can... For the RF calculation the development version employs a rapid implementation using Day's (1985) linear time algorithm... If your trees are routinely larger than this, then a little adaptation of the C++ would be necessary – I presently use 16-bit integers for speed, and a few other shortcuts, so a few days' work would be needed to handle larger trees.

hyanwong commented 3 years ago

Martin has an accessible talk about his information theoretic approach to RF here: https://www.youtube.com/watch?v=UG04uET679s - I found it pretty helpful.