ms609 / TreeDist

Calculate distances between phylogenetic trees in R
https://ms609.github.io/TreeDist/
29 stars 6 forks source link

Memory Leak and Output Stalling when Processing Large Datasets with TreeDistance() #123

Open qwer62667771 opened 3 months ago

qwer62667771 commented 3 months ago

I'm encountering an issue while using the TreeDistance() function to process large datasets. After the computations are completed, the process appears to freeze without returning any output. Concurrently, I observe that the memory usage continues to increase indefinitely. This behavior suggests a possible memory leak within the function when dealing with substantial amounts of data.

ms609 commented 2 months ago

Thanks for the report, and sorry to hear you've come up against this issue. Could you give more details of the nature of your large datasets? At a minimum, it would be helpful to know how many trees of how many leaves you are processing. Better still would be if you could share a problematic dataset so I could attempt to reproduce the issue myself. Thanks!

qwer62667771 commented 2 months ago

Thank you very much for your response. Below are the input files and R code I have been using. The issue seems to primarily occur during the assignment of the result of TreeDistance(tree) to the variable distance, where the process either gets stuck or terminates. I later attempted to calculate distances in parallel, which was successful in some instances but failed in others, and I am unsure of the specific reason behind this.

1-all_genetrees.txt

` library('TreeDist')

setwd('R:/Rstudio workplace/wjj_tree_filter/fna_RF')

tree <- tryCatch({ ape::read.tree('1-all_genetrees.txt') }, error = function(e) { print(paste("Error reading tree file:", e)) quit(save = "no", status = 1) })

distance <- tryCatch({ TreeDistance(tree) }, error = function(e) { print(paste("Error calculating tree distance:", e)) quit(save = "no", status = 1) })

distance_matrix <- as.matrix(distance) write.csv(distance_matrix, "3-distance_matrix.csv", row.names = TRUE) `

ms609 commented 2 months ago

Thanks; I'll try to take a look later this week.

ms609 commented 2 months ago

Thanks for bearing with me whilst I look into this.

Whilst the calculation of the information shared between the trees is reasonably quick, as you have observed, converting these into distances requires calculating the maximum distance between trees with non-overlapping leaf labels – and this post-processing takes much longer, as I've not invested much time in optimizing this.

One delay arises because the trees are presented with node labels. I recently updated the code that reorders trees for analysis and normalization to preserve node labels, but this additional code is not optimized for speed. I'll update the code to automatically remove this information when comparing trees, but in the meantime you can run

trees <- tree
trees[] <- lapply(trees, "[[<-", "node.label", NULL)
trees[] <- lapply(trees, "[[<-", "edge.length", NULL)
trees <- TreeTools::Preorder(trees)
# Then calculate distance with
TreeDistance(trees)

I've also updated TreeDist to display a progress bar for the post-processing phase, which should give some indication as to how progress is proceeding. Install this version using

devtools::install_github("ms609/TreeTools")
devtools::install_github("ms609/TreeDist")

On my machine, I can now calculates the distances for the trees you provided in around a minute.

There's more that could be done to speed this up – but I can't spare the time for this at present. I'll leave the issue open for when I (or other contributors) have the chance to return to this.

qwer62667771 commented 2 months ago

Thank you for taking the time to resolve this issue. I will try the method you provided and update the R package.