using a benchmark tree correctly with distance functions

jc17659 commented 9 months ago

HI there,

I have been comparing some subsampled simulations I have to their true tree. I want to use the CID and MSID tree distance metrics as suggested in Smith (2020), however I feel I may have not been getting the most out of the functions available in TreeDist().

Initially, I used these blocks of code in an attempt to compare each inferred consensus tree against the true benchmark:

foreach(i = 1:length(consensus_trees), .verbose=F)%do% {
TreeDist::ClusteringInfoDist(tree1 = consensus_trees[[i]], tree2 = true_benchmark)
}

foreach(i = 1:length(consensus_trees), .verbose=F)%do% {
  TreeDist::MatchingSplitInfoDistance(tree1 = consensus_trees[[i]], tree2 = true_benchmark)
}

However, I have been reading over the documentation and have got a little confused about whether i should be normalising the results against the true tree, and whether the above code is doing what I intended. Should I have been in fact doing it the following way, and the above code isn't effectively comparing each consensus tree to the true benchmark?:

foreach(i = 1:length(consensus_trees), .verbose=F)%do% {
TreeDist::ClusteringInfoDist(tree1 = consensus_trees[[i]], tree2 = true_benchmark, normalize = ClusteringEntropy(true_benchmark))
}

foreach(i = 1:length(consensus_trees), .verbose=F)%do% {
  TreeDist::MatchingSplitInfoDistance(tree1 = consensus_trees[[i]], tree2 = true_benchmark, normalize = SplitwiseInfo(true_benchmark))
}

Even further, should I have been using the TreeDist() function instead of ClusteringInfoDist() or maybe even the pmax normalise option? Apologies for the confusion, but I would like to compare the different tree metrics effectively, in addition to visualising how the subsampled trees compare against the benchmarks equally effectively.

Do you have any suggestions about what might look best in this regard? I'm expecting the trees to broadly start to be more similar to the benchmarks, and there will be a lot of subsampling (>5K trees) to compare against the true tree (which isn't that large, ~80 tips). Having a larger range of resultant values would be good, but I wouldn't want to artificially force this, of course.

When I run the analysis as in the first block I get a range from ~1800 bits to ~ 800bits in MSID, and ~35 to ~15 bits in CID. It looks like including the normalization options in the second block would perhaps make the results look a little more intuitive, ranging from ~0.99 to 0.42 in MSID and ~1.5 to ~0.75 in CID.

I Hope you can help, any thoughts would be greatly appreciated. Information theory is very new to me!

Thanks

ms609 commented 8 months ago

Thanks for the message, sorry for the slow reply – I'll try to get a response to you this week.

ms609 commented 7 months ago

Sorry about the delay.

Your first block of code calculates the unnormalized distances between trees in units of bits. As the splitwise and clustering concepts of information are quite different, the absolute values between the two methods are not directly comparable. Moreover, the matching split information distance does not have quite as straightforward an interpretation as the Shared phylogenetic information or Mutual clustering information measures.
Your second block aims to normalize these against the best case scenario in which the consensus tree contains all the information present in the benchmark. This makes sense, and is what I'd aim to do.
Normalizing with pmax probably isn't what you require here: as you are interested in how much of the information in your benchmark tree is retained, that's the value you want to normalize against. I've given some worked examples of how normalization works at https://ms609.github.io/TreeDist/dev/reference/NormalizeInfo.html. As consensus trees usually contain information-depleting polytomies, pmax would usually equate to the information in the benchmark tree, except in the rare case that a particularly well resolved consensus tree happened to have a shape with slightly more information due to its balance.
TreeDistance() is an alias for ClusteringInfoDistance(); the only difference is how easy the code is to read/write.
It's intriguing that your normalized CID values exceed 1 by such a large margin. Do all your consensus trees exhibit the same leaves as the reference tree?

Hope that makes sense, and that the updated documentation makes this clearer for other users. Do let me know if you have follow-on questions.

ms609 commented 3 months ago

I'm closing this as completed for now, but if further follow-up would be helpful, please do re-open the issue.

ms609 / TreeDist

using a benchmark tree correctly with distance functions #116