ms609 / TreeDist

Calculate distances between phylogenetic trees in R
https://ms609.github.io/TreeDist/
28 stars 6 forks source link

using a benchmark tree correctly with distance functions #116

Closed jc17659 closed 1 month ago

jc17659 commented 7 months ago

HI there,

I have been comparing some subsampled simulations I have to their true tree. I want to use the CID and MSID tree distance metrics as suggested in Smith (2020), however I feel I may have not been getting the most out of the functions available in TreeDist().

Initially, I used these blocks of code in an attempt to compare each inferred consensus tree against the true benchmark:

foreach(i = 1:length(consensus_trees), .verbose=F)%do% {
TreeDist::ClusteringInfoDist(tree1 = consensus_trees[[i]], tree2 = true_benchmark)
}

foreach(i = 1:length(consensus_trees), .verbose=F)%do% {
  TreeDist::MatchingSplitInfoDistance(tree1 = consensus_trees[[i]], tree2 = true_benchmark)
}

However, I have been reading over the documentation and have got a little confused about whether i should be normalising the results against the true tree, and whether the above code is doing what I intended. Should I have been in fact doing it the following way, and the above code isn't effectively comparing each consensus tree to the true benchmark?:

foreach(i = 1:length(consensus_trees), .verbose=F)%do% {
TreeDist::ClusteringInfoDist(tree1 = consensus_trees[[i]], tree2 = true_benchmark, normalize = ClusteringEntropy(true_benchmark))
}

foreach(i = 1:length(consensus_trees), .verbose=F)%do% {
  TreeDist::MatchingSplitInfoDistance(tree1 = consensus_trees[[i]], tree2 = true_benchmark, normalize = SplitwiseInfo(true_benchmark))
}

Even further, should I have been using the TreeDist() function instead of ClusteringInfoDist() or maybe even the pmax normalise option? Apologies for the confusion, but I would like to compare the different tree metrics effectively, in addition to visualising how the subsampled trees compare against the benchmarks equally effectively.

Do you have any suggestions about what might look best in this regard? I'm expecting the trees to broadly start to be more similar to the benchmarks, and there will be a lot of subsampling (>5K trees) to compare against the true tree (which isn't that large, ~80 tips). Having a larger range of resultant values would be good, but I wouldn't want to artificially force this, of course.

When I run the analysis as in the first block I get a range from ~1800 bits to ~ 800bits in MSID, and ~35 to ~15 bits in CID. It looks like including the normalization options in the second block would perhaps make the results look a little more intuitive, ranging from ~0.99 to 0.42 in MSID and ~1.5 to ~0.75 in CID.

I Hope you can help, any thoughts would be greatly appreciated. Information theory is very new to me!

Thanks

ms609 commented 6 months ago

Thanks for the message, sorry for the slow reply – I'll try to get a response to you this week.

ms609 commented 5 months ago

Sorry about the delay.

Hope that makes sense, and that the updated documentation makes this clearer for other users. Do let me know if you have follow-on questions.

ms609 commented 1 month ago

I'm closing this as completed for now, but if further follow-up would be helpful, please do re-open the issue.