An R package for fitting Quinlan's C5.0 classification model
Memory Leak? #19

Closed infus0815 closed 3 years ago

infus0815 commented 5 years ago

I've been using c50 trees to solve a pairwise ranking problem.

That leads to the need of having to create several c50 models for each problem. The issue is that the memory allocated during a C5.0 call never gets released to the system. That means if im doing 100 or more C5.0 calls, eventually the rsession uses all available memory. Even after finishing the script the memory is still not released, forcing a session restart.

Tested both on windows and linux with the same results.

I wonder if there's a quick fix as c50 trees is what giving me the best prediction results.

SamGG commented 5 years ago

Hi, Dummy question: did you do a garbage collectorgc() after having rm(mytree)? Best.

infus0815 commented 5 years ago

Yep gc() doesn't free it either. Tested every possible thing. Clearing environment and hidden vars included.

Only restarting rsession frees the allocated memory.

SamGG commented 5 years ago

Thanks for clarifying this. Let's wait for the developer feedback.

topepo commented 5 years ago

Can you give some code to test with and the results of sessionInfo()?

infus0815 commented 5 years ago

While making a script from my own code to show you i noticed that what probably causes it is, for example, a column with a high amount of factors.. Managed to replicate the memory problem in this simple script using churn dataset



churnTrain[, 2] <- factor(churnTrain[, 2])

lapply(1:300, function(x) {
  treeModel <- C5.0(x = churnTrain[, 1:3], y = churnTrain$churn)

# OR
# for(i in 1:300) {
#   treeModel <- C5.0(x = churnTrain[, 1:3], y = churnTrain$churn)
#   remove(treeModel)
#   gc()
# }

I know it doesn't make sense to factor the column i did there but its only to replicate the problem.

If you run that script more than one time you can also see that the memory allocated in the first run is not used anymore. In my problem i have datasets with columns with even more factors leading to the ram usage skyrocketing to almost full usage(8gb) in a couple of seconds.

Session Info:

R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS

Matrix products: default BLAS: /usr/lib/libblas/ LAPACK: /usr/lib/lapack/


attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] C50_0.1.2

loaded via a namespace (and not attached): [1] Rcpp_1.0.0 lattice_0.20-38 mvtnorm_1.0-8 grid_3.5.1 plyr_1.8.4
[6] magrittr_1.5 stringi_1.2.4 reshape2_1.4.3 rpart_4.1-13 Matrix_1.2-15
[11] partykit_1.2-2 splines_3.5.1 Formula_1.2-3 tools_3.5.1 stringr_1.3.1
[16] Cubist_0.2.2 survival_2.43-1 compiler_3.5.1 libcoin_1.0-1 inum_1.0-0

SamGG commented 5 years ago

I used Hadley's chapter about memory at I think it's worth reading to understand the current case. Interestingly the first call takes memory that is not released with the remove call. Next calls require a small amount of memory and the leak is very small but real. Hope this help also.


topepo commented 3 years ago

I have not been able to track this down. Please add a PR if you can find the issue.