Closed infus0815 closed 3 years ago
Hi,
Dummy question: did you do a garbage collectorgc()
after having rm(mytree)
?
Best.
Yep gc() doesn't free it either. Tested every possible thing. Clearing environment and hidden vars included.
Only restarting rsession frees the allocated memory.
Thanks for clarifying this. Let's wait for the developer feedback.
Can you give some code to test with and the results of sessionInfo()
?
While making a script from my own code to show you i noticed that what probably causes it is, for example, a column with a high amount of factors.. Managed to replicate the memory problem in this simple script using churn dataset
library(C50)
data(churn)
churnTrain[, 2] <- factor(churnTrain[, 2])
lapply(1:300, function(x) {
treeModel <- C5.0(x = churnTrain[, 1:3], y = churnTrain$churn)
remove(treeModel)
})
# OR
# for(i in 1:300) {
# treeModel <- C5.0(x = churnTrain[, 1:3], y = churnTrain$churn)
# remove(treeModel)
# gc()
# }
I know it doesn't make sense to factor the column i did there but its only to replicate the problem.
If you run that script more than one time you can also see that the memory allocated in the first run is not used anymore. In my problem i have datasets with columns with even more factors leading to the ram usage skyrocketing to almost full usage(8gb) in a couple of seconds.
Session Info:
R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS
Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=pt_PT.UTF-8 LC_NUMERIC=C LC_TIME=pt_PT.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=pt_PT.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=pt_PT.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] C50_0.1.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 lattice_0.20-38 mvtnorm_1.0-8 grid_3.5.1 plyr_1.8.4
[6] magrittr_1.5 stringi_1.2.4 reshape2_1.4.3 rpart_4.1-13 Matrix_1.2-15
[11] partykit_1.2-2 splines_3.5.1 Formula_1.2-3 tools_3.5.1 stringr_1.3.1
[16] Cubist_0.2.2 survival_2.43-1 compiler_3.5.1 libcoin_1.0-1 inum_1.0-0
I used Hadley's chapter about memory at http://adv-r.had.co.nz/memory.html. I think it's worth reading to understand the current case. Interestingly the first call takes memory that is not released with the remove call. Next calls require a small amount of memory and the leak is very small but real. Hope this help also.
library(C50)
data(churn)
head(churnTrain)
#> state account_length area_code international_plan voice_mail_plan
#> 1 KS 128 area_code_415 no yes
#> 2 OH 107 area_code_415 no yes
#> 3 NJ 137 area_code_415 no no
#> 4 OH 84 area_code_408 yes no
#> 5 OK 75 area_code_415 yes no
#> 6 AL 118 area_code_510 yes no
#> number_vmail_messages total_day_minutes total_day_calls total_day_charge
#> 1 25 265.1 110 45.07
#> 2 26 161.6 123 27.47
#> 3 0 243.4 114 41.38
#> 4 0 299.4 71 50.90
#> 5 0 166.7 113 28.34
#> 6 0 223.4 98 37.98
#> total_eve_minutes total_eve_calls total_eve_charge total_night_minutes
#> 1 197.4 99 16.78 244.7
#> 2 195.5 103 16.62 254.4
#> 3 121.2 110 10.30 162.6
#> 4 61.9 88 5.26 196.9
#> 5 148.3 122 12.61 186.9
#> 6 220.6 101 18.75 203.9
#> total_night_calls total_night_charge total_intl_minutes total_intl_calls
#> 1 91 11.01 10.0 3
#> 2 103 11.45 13.7 3
#> 3 104 7.32 12.2 5
#> 4 89 8.86 6.6 7
#> 5 121 8.41 10.1 3
#> 6 118 9.18 6.3 6
#> total_intl_charge number_customer_service_calls churn
#> 1 2.70 1 no
#> 2 3.70 1 no
#> 3 3.29 0 no
#> 4 1.78 2 no
#> 5 2.73 3 no
#> 6 1.70 0 no
dim(churnTrain)
#> [1] 3333 20
summary(churnTrain)
#> state account_length area_code international_plan
#> WV : 106 Min. : 1.0 area_code_408: 838 no :3010
#> MN : 84 1st Qu.: 74.0 area_code_415:1655 yes: 323
#> NY : 83 Median :101.0 area_code_510: 840
#> AL : 80 Mean :101.1
#> OH : 78 3rd Qu.:127.0
#> OR : 78 Max. :243.0
#> (Other):2824
#> voice_mail_plan number_vmail_messages total_day_minutes total_day_calls
#> no :2411 Min. : 0.000 Min. : 0.0 Min. : 0.0
#> yes: 922 1st Qu.: 0.000 1st Qu.:143.7 1st Qu.: 87.0
#> Median : 0.000 Median :179.4 Median :101.0
#> Mean : 8.099 Mean :179.8 Mean :100.4
#> 3rd Qu.:20.000 3rd Qu.:216.4 3rd Qu.:114.0
#> Max. :51.000 Max. :350.8 Max. :165.0
#>
#> total_day_charge total_eve_minutes total_eve_calls total_eve_charge
#> Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00
#> 1st Qu.:24.43 1st Qu.:166.6 1st Qu.: 87.0 1st Qu.:14.16
#> Median :30.50 Median :201.4 Median :100.0 Median :17.12
#> Mean :30.56 Mean :201.0 Mean :100.1 Mean :17.08
#> 3rd Qu.:36.79 3rd Qu.:235.3 3rd Qu.:114.0 3rd Qu.:20.00
#> Max. :59.64 Max. :363.7 Max. :170.0 Max. :30.91
#>
#> total_night_minutes total_night_calls total_night_charge
#> Min. : 23.2 Min. : 33.0 Min. : 1.040
#> 1st Qu.:167.0 1st Qu.: 87.0 1st Qu.: 7.520
#> Median :201.2 Median :100.0 Median : 9.050
#> Mean :200.9 Mean :100.1 Mean : 9.039
#> 3rd Qu.:235.3 3rd Qu.:113.0 3rd Qu.:10.590
#> Max. :395.0 Max. :175.0 Max. :17.770
#>
#> total_intl_minutes total_intl_calls total_intl_charge
#> Min. : 0.00 Min. : 0.000 Min. :0.000
#> 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
#> Median :10.30 Median : 4.000 Median :2.780
#> Mean :10.24 Mean : 4.479 Mean :2.765
#> 3rd Qu.:12.10 3rd Qu.: 6.000 3rd Qu.:3.270
#> Max. :20.00 Max. :20.000 Max. :5.400
#>
#> number_customer_service_calls churn
#> Min. :0.000 yes: 483
#> 1st Qu.:1.000 no :2850
#> Median :1.000
#> Mean :1.563
#> 3rd Qu.:2.000
#> Max. :9.000
#>
hist(churnTrain[, 2])
rug(churnTrain[, 2])
head(sort(churnTrain[, 2]), 50)
#> [1] 1 1 1 1 1 1 1 1 2 3 3 3 3 3 4 5 6 6 7 7 8 9 9
#> [24] 9 10 10 10 11 11 11 11 12 12 12 13 13 13 13 13 13 13 13 13 15 15 15
#> [47] 16 16 16 16
library(pryr)
object_size(churnTrain)
#> 382 kB
churnTrain[, 2] <- factor(churnTrain[, 2])
object_size(churnTrain)
#> 395 kB
for(i in 1:30) {
cat(i, "\n", mem_used(), "\n", sep = "")
cat(mem_change(treeModel <- C5.0(x = churnTrain[, 1:3], y = churnTrain$churn)), "\n")
cat(mem_change(remove(treeModel)), "\n")
gc()
cat(mem_used(), "\n")
}
#> 1
#> 111131416
#> 243424
#> -3624
#> 111369584
#> 2
#> 111370080
#> 5152
#> -3680
#> 111369952
#> 3
#> 111370416
#> 5152
#> -3680
#> 111370096
#> 4
#> 111370560
#> 5152
#> -3680
#> 111370264
#> 5
#> 111370736
#> 5152
#> -3680
#> 111370448
#> 6
#> 111370920
#> 5152
#> -3680
#> 111370632
#> 7
#> 111371104
#> 5152
#> -3680
#> 111370816
#> 8
#> 111371288
#> 5152
#> -3680
#> 111371000
#> 9
#> 111371472
#> 5152
#> -3680
#> 111371184
#> 10
#> 111371656
#> 5152
#> -3680
#> 111371368
#> 11
#> 111371840
#> 5152
#> -3680
#> 111371552
#> 12
#> 111372024
#> 5152
#> -3680
#> 111371736
#> 13
#> 111372208
#> 5152
#> -3680
#> 111371920
#> 14
#> 111372392
#> 5152
#> -3680
#> 111372104
#> 15
#> 111372576
#> 5152
#> -3680
#> 111372288
#> 16
#> 111372760
#> 5152
#> -3680
#> 111372472
#> 17
#> 111372944
#> 5152
#> -3680
#> 111372656
#> 18
#> 111373128
#> 5152
#> -3680
#> 111372840
#> 19
#> 111373312
#> 5152
#> -3680
#> 111373024
#> 20
#> 111373496
#> 5152
#> -3680
#> 111373208
#> 21
#> 111373680
#> 5152
#> -3680
#> 111373392
#> 22
#> 111373864
#> 5152
#> -3680
#> 111373576
#> 23
#> 111374048
#> 5152
#> -3680
#> 111373760
#> 24
#> 111374232
#> 5152
#> -3680
#> 111373944
#> 25
#> 111374416
#> 5152
#> -3680
#> 111374128
#> 26
#> 111374600
#> 5152
#> -3680
#> 111374312
#> 27
#> 111374784
#> 5152
#> -3680
#> 111374496
#> 28
#> 111374968
#> 5152
#> -3680
#> 111374680
#> 29
#> 111375152
#> 5152
#> -3680
#> 111374864
#> 30
#> 111375336
#> 5152
#> -3680
#> 111375048
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252
#> [2] LC_CTYPE=English_United Kingdom.1252
#> [3] LC_MONETARY=English_United Kingdom.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pryr_0.1.4 C50_0.1.2
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.0 Formula_1.2-3 knitr_1.20 magrittr_1.5
#> [5] splines_3.5.1 lattice_0.20-38 stringr_1.3.1 plyr_1.8.4
#> [9] tools_3.5.1 grid_3.5.1 htmltools_0.3.6 yaml_2.2.0
#> [13] survival_2.43-1 rprojroot_1.3-2 digest_0.6.18 inum_1.0-0
#> [17] libcoin_1.0-1 Matrix_1.2-15 reshape2_1.4.3 codetools_0.2-15
#> [21] rpart_4.1-13 Cubist_0.2.2 evaluate_0.12 rmarkdown_1.10
#> [25] stringi_1.2.4 compiler_3.5.1 backports_1.1.2 partykit_1.2-2
#> [29] mvtnorm_1.0-8
Created on 2018-12-04 by the reprex package (v0.2.1)
I have not been able to track this down. Please add a PR if you can find the issue.
I've been using c50 trees to solve a pairwise ranking problem.
That leads to the need of having to create several c50 models for each problem. The issue is that the memory allocated during a C5.0 call never gets released to the system. That means if im doing 100 or more C5.0 calls, eventually the rsession uses all available memory. Even after finishing the script the memory is still not released, forcing a session restart.
Tested both on windows and linux with the same results.
I wonder if there's a quick fix as c50 trees is what giving me the best prediction results.