ms609 / TreeDist

Calculate distances between phylogenetic trees in R
https://ms609.github.io/TreeDist/
28 stars 6 forks source link

Trees with different leaves: #44

Closed cdp-rna closed 11 months ago

cdp-rna commented 3 years ago

Hi,

Is it possible to analyze trees with different leaf labels? I am interested in the general architecture of the tree rather than the identity of the individuals within...

Thanks, Christina

ms609 commented 3 years ago

Sorry for the slow reply – I missed this comment. It's certainly possible, but I'd have to have a clearer idea of exactly what aspect of similarity you were trying to capture. It sounds like you might be interested in the shape of the tree, but not the relationships that it implies?

Jigyasa3 commented 3 years ago

Hey @ms609

I have a similar question where I am interested in getting a "global" estimate of how similar the two trees with unequal no. of leaves are. My two trees are host and its symbiont, where we have multiple symbionts per host (thereby leading to unequal no. of leaves). I want to examine the NyeSimilarity() and Generalized RF method on the two trees, but there are two queries- a) Can we account for the no. of leaves between the two trees being different? b) I can normalize the distance methods on the host tree, but is it possible to compare the Nye and Generalized RF methods with each other? For example, if the Nye method gives a distance of 0.12 while the Generalized RF method (eg-ClusteringInfoDist()) gives the distance of 0.23 how can I generalize the results to explain if the two trees are "more" concordant or "less" concordant with each other?

Looking forward to your reply!

ms609 commented 3 years ago

Hi @Jigyasa3,

The GRF/Nye methods aim to quantify the relationship information between two trees. This is only one aspect of tree similarity – similarity in number of leaves, for example, cannot readily be incorporated into this aspect of tree distance.

When comparing host/symbiont trees, the question boils down to which leaf in the host tree ought to be paired with which leaf in the symbiont tree. Following my intuition, if host A was associated with two symbionts, I'd duplicate leaf A in the host tree, in the same position (using TreeTools::AddTip()). Then, dropping taxa in the host tree that lack symbionts will produce two trees with a 1:1 tip correspondence.

The Nye & GRF distances are somewhat correlated, and it should be unusual for tree pairs to be ranked differently between the methods. Ultimately, the two methods capture slightly different aspects of tree similarity; one way to explore where differences might arise is by using the VisualizeMatching() function. For an even more complete view, you might also consider calculating quartet distances, which offer a more complementary viewpoint to GRF methods (including Nye's).

Hope that goes some way to answering your question, and apologies if I've misunderstood what you were asking – feel free to follow up.

Cheers,

Martin

Jigyasa3 commented 3 years ago

Hey @ms609

Thank you so much for replying! Thanks for the suggestion for AddTip() function to match the number of host and symbiont tips. I am doing exactly that. As for the second question, sorry I wasn't clear with my question. I am interested in looking at if given a pair of host-symbiont trees (I have at least 50 pairs), using Nye and Generalized RF methods, can we say that one host-symbiont pair is more co-evolving than the other.

For example, if I get a Generalized RF value of 0.23 for one pair and 0.45 for another pair, can I conclude that the second pair is more co-evolving than the other? If this cannot be done, could you suggest another way to globally compare Generalized RF (and Nye) stats between trees? BTW, tree1 (i.e host) remains the same for all 50 pairs.

Looking forward to your suggestions!

ms609 commented 3 years ago

Hi @Jigyasa3, yes, normalized values are comparable between any pair of trees, regardless of size. If you are calculating the distance with similarity = TRUE, then a pair of trees with a higher value are more similar (and thus denote more co-evolution) than a pair of trees with a lower value on the same metric.

This said, trees with more leaves are expected to be more different due to chance. If some trees have many more leaves than others, then you might want to normalize similarity/difference values against the expected value of a random pair of n-leaf trees, in order to determine how much of the observed similarity is explained by co-evolution.

All the best,

Martin

Jigyasa3 commented 3 years ago

Hey @ms609

Thank you for replying and suggestion about randomization. I am trying to follow the vignette on random tree distance generation https://cran.r-project.org/web/packages/TreeDist/vignettes/using-distances.html#normalizing May I please ask a few questions to understand what this function is doing, and how I can apply it to my dataset-

Q1. In the following code the randomTreeDistances is based on data('randomTreeDistances', package = 'TreeDistData'), but can we apply it on any tree1 and tree2 combination as it is? Does the expectedCID value remains the same regardless?

expectedCID <- randomTreeDistances['cid', 'mean', '9']
ClusteringInfoDistance(tree1, tree2, normalize = TRUE) / expectedCID

Q2/ In another example in the same vignette, you have generated a true tree, oneway, and threeway degraded trees. If I use my host tree as a "true tree", then would oneway and threeway degraded trees generated from it represent the randomized trees?

example code for Q2- NyeSimilarity(tree1, tree2,normalize = TRUE,similarity = TRUE) [1] 0.317191

trueTree<-tree1

library('TreeSearch')
oneAway <- structure(lapply(seq_len(200), function (x) {
  tbrTree <- TBR(trueTree)
  ape::consensus(list(tbrTree,
                      NNI(tbrTree),
                      NNI(tbrTree),
                      NNI(tbrTree)))
}), class = 'multiPhylo') #seq_len=200 means 200 random trees are generated.

threeAway <- structure(lapply(seq_len(200), function (x) {
  tbrTree <- TBR(TBR(TBR(trueTree)))
  ape::consensus(list(tbrTree, 
                      NNI(NNI(tbrTree)),
                      NNI(NNI(tbrTree)),
                      NNI(NNI(tbrTree))))
}), class = 'multiPhylo')
correct1 <- MutualClusteringInfo(trueTree, oneAway)
correct3 <- MutualClusteringInfo(trueTree, threeAway)

infoInTree1 <- ClusteringEntropy(oneAway)
infoInTree3 <- ClusteringEntropy(threeAway)

unresolved1 <- ClusteringEntropy(trueTree) - infoInTree1
unresolved3 <- ClusteringEntropy(trueTree) - infoInTree3

incorrect1 <- infoInTree1 - correct1
incorrect3 <- infoInTree3 - correct3

plot-

library(kdensity)
dat1<-correct1 / infoInTree1
dat3<-correct3 / infoInTree3
kde1 <- kdensity(dat1)
kdeRange1 <- kdensity:::get_range(kde1)  

plot(kde1(kdeRange1))
points(0.317191,y=NULL,type="p",col="red") #output from NyeSimilarity()

It seems like tree2 is not different from randomized trees. What do you think? Is it correct? Would it be possible to get a p-value out of this?

image

ms609 commented 3 years ago

A1: The random distances have been generated by calculating the distances between many pairs of random trees. As the CID is not very influenced by tree shape, this expected value should be suitable for any pair of trees. If tree1 is constant, then you could calculate your own expected value with:

library("TreeTools")
tree1 <- BalancedTree(8) # insert your own tree here
nRep <- 100 # Use more replicates for more accurate estimate of expected value
randomTrees <- lapply(logical(nRep), function (x) RandomTree(tree1$tip.label))
randomDists <- ClusteringInfoDistance(tree1, randomTrees, normalize = TRUE)
expectedCID <- mean(randomDists)

This value shouldn't differ much from the expected value between any random pair of trees, but if precision is important, there's an argument for calculating your own value this way.

ms609 commented 3 years ago

A2: Apologies if I've misunderstood the motivation behind this question.

If you're trying to see whether tree2 is more similar to tree1 than expected by chance, then you can use your sample of random distances again:

tree2 <- PectinateTree(8) # insert your own tree here
dist12 <- ClusteringInfoDistance(tree1, tree2, normalize = TRUE)
# Now count the number of random trees that are this similar to tree1
nThisSimilar <- sum(randomDists < dist12)
pValue <- nThisSimilar / nRep
Jigyasa3 commented 3 years ago

Hey @ms609

Thank you so much for the randomization method! I tried comparing the randomized host tree (tree1) with ClusteringInfoDistance() and NyeSimilarity() for the same host-symbiont pair, and I get very different p-values. ClusteringInfoDistance() give p=0, while NyeSimilarity() gives p=1. I used 100,000 replications for the randomization method.

Why do you think that is happening? It's the same host-symbiont pair. Shouldn't the two distance methods give similar results?

ms609 commented 3 years ago

That'll be because ClusteringInfoDistance() measures distance, and NyeSimilarity() measures similarity. For a "Nye distance", you can use NyeSimilarity(similarity = FALSE) , or subtract normalized similarities from 1. To calculate a p value fron your similarity scores, reverse the direction of the inequality in nThisSimilar <- sum(randomDists > dist12).

Jigyasa3 commented 2 years ago

Hi @ms609

Thank you for the reply about NyeSimilarity measuring similarity and not distance.

I am using the following code-

`dist_rf <- ClusteringInfoDistance(tree1, tree2, normalize = TreeTools::NSplits(tree1))

dist_ny <- NyeSimilarity(tree1, tree2, normalize = TreeTools::NSplits(tree1),similarity = FALSE)

dist_ny2 <- NyeSimilarity(tree1, tree2, normalize = TRUE , similarity = FALSE) `

But the outputs are very different- ClusteringInfoDistance() gives me 0.320799663986542 NyeSimilarity(normalize = TreeTools::NSplits(tree1), similarity=FALSE) gives me 1.36131705538099 NyeSimilarity(normalize=TRUE, similarity =FALSE) gives me 0.680658527690494

ms609 commented 2 years ago

Yes, the absolute values will be different, because they are different measures – just as you'd obtain different distances between two cities depending on whether you are measuring the distance by road, rail or river.

You may wish to normalize the results so they fall on a scale of zero to one (see the documentation for normalize = TRUE to see how this is accomplished).

Jigyasa3 commented 2 years ago

Thank you for a quick response @ms609 !

When I use normalize=TRUE for both methods then they are "more" comparable.

I was wondering if there is a cutoff? For example, if the two metrics have an output greater than 0.5, then they are more comparable to each other?

ms609 commented 2 years ago

The metrics aren't meant to be equivalent. See for example the behaviour of different metrics with random trees at https://ms609.github.io/TreeDist/articles/using-distances.html#normalizing-to-random-similarity, and, in more detail, https://ms609.github.io/TreeDistData/articles/09-expected-similarity.html

ms609 commented 11 months ago

TreeDist v2.6.0 now natively supports comparison of trees with non-identical sets of leaves. I think this addresses the topics raised in this issue, but please feel free to open a new issue if there's anything I've overlooked.