hmorlon / PANDA

Phylogenetic ANalyses of DiversificAtion
24 stars 15 forks source link

JSDtree fails on some other trees #26

Closed richelbilderbeek closed 4 years ago

richelbilderbeek commented 4 years ago

Thanks for fixing #23 and #25! Here is another one (with the latest version of master, of course):

When I run JSDtree, it usually does exactly what it should: produce a matrix with zeroes on the diagonal and Jensen-Shannon distances in the other cells. Well done :+1:

However, for some trees, the function fails:

tree_1 <- ape::read.tree(text = "(((t3:0.1411719966,(t5:0.03526457828,t4:0.03526457828):0.1059074183):0.06732406607,t6:0.2084960626):1.756416178,((t7:0.2838467566,t1:0.2838467566):0.4396700074,t2:0.723516764):1.241395476);")
tree_2 <- ape::read.tree(text = "((t1:0.01716155337,t4:0.01716155337):0.003048532642,(t3:0.01927721092,t2:0.01927721092):0.0009328750893);")
trees <- c(tree_1, tree_2)

# Should pass
testthat::expect_silent(
  RPANDA::JSDtree(trees)
)

Running this, gives the following error:

Error in density.default(m[[i]]) : 
  need at least 2 points to select a bandwidth automatically

Just for convenience, these are the trees:

plot

I would expect either a clear error message or a (valid) result. I hope you will add this reprex to your test suite and fix it.

Just for completness, I add the complete brute-force script I used to find these trees:

library(RPANDA)

n <- 100
distances <- rep(NA, n)

for (i in seq_len(n)) {
  set.seed(i)
  trees <- c(
    ape::rcoal(n = sample(x = seq(3, 10), size = 1)),
    ape::rcoal(n = sample(x = seq(3, 10), size = 1))
  )
  distances[i] <- RPANDA::JSDtree(phylo = trees)[1, 2]
}

Good luck fixing and keep up the good work :+1:

elewitus commented 4 years ago

This is because of the size of tree_2. It has no eigenvalues >=1, so its density cannot be computed. In the paper (Lewitus & Morlon, Syst Bio, 2016) we recommend against comparing trees of considerably different sizes (70-fold difference in total edge-length between tree_1 and tree_2) using the MGL, but recommend instead using the normalized MGL (JSDtree(phylos,meth='normal1')). More generally, the limited number of possible differences among unlabelled trees of such size (<10 tips) may present problems for distinguishing processes. However, if you want to circumvent the error, you can alter lines 58-59 to be:

m[[i]]<-subset(treeEigen[[i]]$values, treeEigen[[i]]$values>=0)

Thanks!

richelbilderbeek commented 4 years ago

@elewitus: you want me to submit a Pull Request with that fix?

elewitus commented 4 years ago

@richelbilderbeek Do you mean to lines 58-59? I was suggesting you could alter these locally, but I want them to remain as is for the reasons stated above. I may just add a more informative error message.

richelbilderbeek commented 4 years ago

Would be great! Thanks :+1: !

richelbilderbeek commented 4 years ago

Upon re-read, there are two things:

The first point you mention, is that the tree_2 has no eigenvalue bigger or equal to one. I hope this will give a helpful error, e.g. like:

tree_1 <- ape::read.tree(text = "(((t3:0.1411719966,(t5:0.03526457828,t4:0.03526457828):0.1059074183):0.06732406607,t6:0.2084960626):1.756416178,((t7:0.2838467566,t1:0.2838467566):0.4396700074,t2:0.723516764):1.241395476);")
tree_2 <- ape::read.tree(text = "((t1:0.01716155337,t4:0.01716155337):0.003048532642,(t3:0.01927721092,t2:0.01927721092):0.0009328750893);")
trees <- c(tree_1, tree_2)

testthat::expect_error(
  RPANDA::JSDtree(trees),
  "Cannot calculate Jensen-Shannen distance, as phylo[[2]] has no eigenvalues bigger or equal to one"
)

Second point you mentioned: a different method should be used when comparing trees of very different sizes. I would expect a more helpful error message:

tree_1 <- ape::read.tree(text = "((A:2, B:2):1, C:3);")
tree_1 <- ape::read.tree(text = "((A:2000, B:2000):1000, C:3000);")
trees <- c(tree_1, tree_2)
testthat::expect_error(
  RPANDA::JSDtree(trees, method = "standard"),
  "Use 'method = \"normal1\"' to compare trees with sizes more than ten-fold"
)

tree_1 <- ape::read.tree(text = "((A:2, B:2):1, C:3);")
tree_1 <- ape::read.tree(text = "((A:2000, B:2000):1000, C:3000);")
trees <- c(tree_1, tree_2)
testthat::expect_silent(
  RPANDA::JSDtree(trees, method = "normal1")
)

For this second point, I would expect that the defaults would just work anyway; that JSDtree picks the right method itself (why bother the user with that?):

tree_1 <- ape::read.tree(text = "((A:2, B:2):1, C:3);")
tree_1 <- ape::read.tree(text = "((A:2000, B:2000):1000, C:3000);")
trees <- c(tree_1, tree_2)
testthat::expect_silent(
  RPANDA::JSDtree(trees)
)

Would be cool to see these test among the testthat test! Sure, I can make a Pull Request for that :rainbow:

Good luck, keep up the good work :+1:

richelbilderbeek commented 4 years ago

@elewitus: what is the status of this Issue?

I don't intend to rush you, but I am curious :rainbow:

elewitus commented 4 years ago

That choice should not be determined by tree size alone. It is important that the user decides which method to use, based on h/h data and understanding of the approach.

I have not yet gotten to changing the error message, but it is on my list.