thijsjanzen / nLTT

Repository for the R nLTT package
GNU General Public License v2.0
6 stars 4 forks source link

nLTT is present less around 0.5 #32

Closed richelbilderbeek closed 7 years ago

richelbilderbeek commented 7 years ago

From @franciscorichter :

If you put attention to the average nLTT plot of phylogenies on the report you see that around nLTT = 0.5 there is always a white space showing that the statistic normally does not reach some of those points around.

From his work

2017-02-16-155138_1600x1200_scrot

And added annotation:

2017-02-16-155138_1600x1200_scrot

richelbilderbeek commented 7 years ago

I think that this behavior follows from regular mathematics. An nLTT statistic is calculated by dividing two integers. If both integers have a limited range, I would expect the results of their division being smooth.

@thijsjanzen : what do you think?

richelbilderbeek commented 7 years ago

To quote myself:

I think that this behavior follows from regular mathematics. An nLTT statistic is calculated by dividing two integers. If both integers have a limited range, I would expect the results of their division being smooth.

Here I demonstrate my idea:

issue

This code shows my explanation:

library(ggplot2)

n <- 25

xs <- seq(1, n)

df <- data.frame(
  num = rep(xs, each = n),
  den = rep(xs, times = n)
)  
df <- subset(df, num <= den)
df$x = df$num / df$den  

qplot(data=df, x = 1, y = x, geom = "point")

I'd conclude my idea is correct, but I'd love some unbiased opinion on that. @franciscorichter maybe?

richelbilderbeek commented 7 years ago

If no comments, I will close this Issue in March.

franciscorichter commented 7 years ago

HI Richel, I don´t completely understand your idea. I agree when you said ''I would expect the results of their division being smooth'', but I do not see how your example shows that, the data you plot there is not smooth, but this is because your example has integers only and not a lot of them, whereas the nLTT statistic work with real numbers (is that true?) and the number of simulations are of the order of 10^5. I am pretty sure that if you change n<-25 by n<-1000, then you will have a smooth line.

richelbilderbeek commented 7 years ago

I try to explain the 'integer division' point below. An nLTT is constructed from integer divisions (the number of lineages divided by the max number of lineages).

OTOH, the RPubs workshop by franciscorichter is convincing. I will try now not to divide integers (as below, I just put it there for reference), but use simulated phylogenies instead, to see if some type coercion happens somewhere.

I am working on it :muscle:

Dividing integers

I used n <- 25 just to show it visibly not smooth when plotted as a point chart.

I will try to do the same now with a histogram.

n <- 25

25

n <- 100

100

n <- 1000

1000

Code used

library(ggplot2)

n <- 25

xs <- seq(1, n)

df <- data.frame(
  num = rep(xs, each = n),
  den = rep(xs, times = n)
)  
df <- subset(df, num <= den)
df$x = df$num / df$den  

qplot(df$x[df$x > 0.4 & df$x < 0.6], binwidth = 0.001)
richelbilderbeek commented 7 years ago

Too bad the RPubs workbook is not an mcve, as it is not complete. I guess I'll have to fix it.

richelbilderbeek commented 7 years ago

Here is the reprex:

devtools::install_github("franciscorichter/dmea")

library(dmea)
library(ggplot2)
library(nLTT)

pp <- proc.time()
s <- sim_phyl(seed=2)
s2 <- phylo2p(drop.fossil(s$newick))
n_trees <- 5000
phylos <- vector('list',length=n_trees)
nLTT <- vector(length=n_trees)
gam <- vector(length=n_trees)
phylos2 <- vector('list',length=n_trees)
nLTT2 <- vector(length=n_trees)
gam2 <- vector(length=n_trees)
dt=0.01
for(i in 1:n_trees){
  r <- rec_tree(wt=s2$t)
  r1 <- p2phylo(r)
  phylos[[i]] <- read.tree(text=r1)
  nLTT[i] <- nLTTstat(s$newick, phylos[[i]], distance_method = "abs")
  gam[i] <- gammaStat(phylos[[i]])
  #
  r <- rec_tree(wt=s2$t,rec_method=2)
  r2 <- p2phylo(r)
  phylos2[[i]] <- read.tree(text=r2)
  nLTT2[i] <- nLTTstat(s$newick, phylos2[[i]], distance_method = "abs")
  gam2[i] <- gammaStat(phylos2[[i]])
}
nltt_values <- get_nltt_values(phylos, dt = dt)
# Plot the phylognies, where the individual nLTT values are visible
qplot(t, nltt, data = nltt_values, geom = "point",
      ylim = c(0,1),
      main = "Average nLTT plot of phylogenies", size = I(0.1)
) + stat_summary(
  fun.data = "mean_cl_boot", color = "red", geom = "smooth"
)
franciscorichter commented 7 years ago

If you install the last version of dmea package then it should be enough to reproduce everything on the Rpubs doc, isn't it?

richelbilderbeek commented 7 years ago

And this is the error I get:

issue

Looks like #31 has just made another victim...

richelbilderbeek commented 7 years ago

@franciscorichter : yes, reproducing the RBook was not that hard, thanks to your great work!

franciscorichter commented 7 years ago

Oh, the trees I am simulating does have extinct species, that means that are not ultrametric and that means that the calculation of nLTTstat function does not make sense, is that true? ... I am not sure if that fact has to do with the gaps on the estimations though. In any case, do you know if is possible to calculate the nLTT statistic for trees considering extinct species?

richelbilderbeek commented 7 years ago

Weird, the reprex did not work anymore :confused: , sent the Issue above.

richelbilderbeek commented 7 years ago

OK, thanks to @franciscorichter for fixing his code, resulting into this:

devtools::install_github("franciscorichter/dmea")
devtools::install_github("richelbilderbeek/nLTT", ref = "develop")

library(dmea)
library(Hmisc)
library(ggplot2)
library(nLTT)

pp <- proc.time()
s <- sim_phyl(seed=2)
s2 <- phylo2p(drop.fossil(s$newick))
n_trees <- 5000
phylos <- vector('list',length=n_trees)

nLTT <- vector(length=n_trees)
gam <- vector(length=n_trees)
phylos2 <- vector('list',length=n_trees)
nLTT2 <- vector(length=n_trees)
gam2 <- vector(length=n_trees)
dt <- 0.01
for(i in 1:n_trees){
  r <- rec_tree(wt=s2$wt)
  r1 <- p2phylo(r)
  phylos[[i]] <- r1
  nLTT[i] <- ltt_stat(s$newick, phylos[[i]])
  gam[i] <- gammaStat(phylos[[i]])
  r <- rec_tree(wt=s2$wt,rec_method=2)
  r2 <- p2phylo(r)
  phylos2[[i]] <- r2
  nLTT2[i] <- ltt_stat(s$newick, phylos2[[i]])
  gam2[i] <- gammaStat(phylos2[[i]])
}

testit::assert(inherits(phylos[[1]], "phylo")) # Prerequisite for get_nltt_values

nltt_values <- get_nltt_values(phylos, dt = dt)
# Plot the phylognies, where the individual nLTT values are visible
qplot(t, nltt, data = nltt_values, geom = "point",
      ylim = c(0,1),
      main = "Average nLTT plot of phylogenies", size = I(0.1)
) + stat_summary(
  fun.data = "mean_cl_boot", color = "red", geom = "smooth"
)

Sadly, it takes too long on my computer.

I will try to create a simpler mcve

richelbilderbeek commented 7 years ago

Here it is. It creates n trees and measures the nLTT statistic at dt resolution:

library(ggplot2)

dt <- 0.001
n <- 1000
phylos <- list()
for (i in 1:n)
{
  phylos[[i]] <- ape::rcoal(i + 1)
}

nltt_values <- nLTT::get_nltt_values(phylos, dt = dt)
# Plot the phylognies, where the individual nLTT values are visible
qplot(t, nltt, data = nltt_values, geom = "point",
      ylim = c(0,1),
      xlim = c(0.98,1.0),
      main = "Average nLTT plot of phylogenies", size = I(0.1)
) + stat_summary(
  fun.data = "mean_cl_boot", color = "red", geom = "smooth"
)

Zoom in on the most interesting part:

nlttplotzoom

Note that I there appears to be no strange behavior of the nLTT statistic here.

@franciscorichter : may it be a feature of the dmea model to result in the nLTT plots as shown in the peculariar plot at the top of this Issue?

franciscorichter commented 7 years ago

when I use ltt_stat I do not get this gaps anymore (with same phylos)...

by other hand I got this, do you know what is it?

 nltt_values <- get_nltt_values(phylos, dt = dt)
Error in get0(oNam, envir = ns) : 
  lazy-load database '/home/p276048/R/x86_64-pc-linux-gnu-library/3.3/nLTT/R/nLTT.rdb' is corrupt
In addition: Warning messages:
1: In get0(oNam, envir = ns) : restarting interrupted promise evaluation
2: In get0(oNam, envir = ns) : internal error -3 in R_decompress1
richelbilderbeek commented 7 years ago

Looks like an off-topic question to me. Try Google and read the first hit

richelbilderbeek commented 7 years ago

As it appears that nLTT works fine, I will close this Issue.