mlr-org / mlr3misc

Miscellaneous helper functions for mlr3
https://mlr3misc.mlr-org.com
GNU Lesser General Public License v3.0
11 stars 2 forks source link

fix: remove srcref after leanification #89

Closed sebffischer closed 11 months ago

sebffischer commented 11 months ago

When compiling R with --with-keep.source, serialized objects were gigantic (and dependent on the loaded packages), see this issue: https://github.com/mlr-org/mlr3misc/issues/88

I tested that when installing mlr3 with --with-keep.source with this version of mlr3misc, the problem disappears. This also caused the failed workflows in the mlr3book

@berndbischl @mllg @mb706

mb706 commented 11 months ago

good find, hadn't thought about the problem that srcref can take up lots of memory

sebffischer commented 11 months ago

I still don't understand why the measures object size depends on the packages that are loaded, do you have an idea why?

e.g. when saving a learner state (when having installed mlr3 with --with-keep.source) the result returned by pryr::object_size() depends on whether mlr3tuning (or other mlr3 packages) are loaded or not.

sebffischer commented 11 months ago

Consider:

library(mlr3verse)
#> Loading required package: mlr3
library(mlr3)

task = tsk("iris")
learner = lrn("classif.rpart")

learner$train(task)

pth = tempfile(fileext = ".rds")

saveRDS(learner$state, pth)

x = readRDS(pth)

pryr::object_size(x)
#> 19.49 MB

vs

library(mlr3)

task = tsk("iris")
learner = lrn("classif.rpart")

learner$train(task)

pth = tempfile(fileext = ".rds")

saveRDS(learner$state, pth)

x = readRDS(pth)

pryr::object_size(x)
#> 4.00 MB
mb706 commented 11 months ago

It gets worse:

library("mlr3")
task = tsk("iris")
learner = lrn("classif.rpart")

learner$train(task)

pth = tempfile(fileext = ".rds")

saveRDS(learner$state, pth)

x = readRDS(pth)

pryr::object_size(x)
#> 4.00 MB
x$train_task$help
#> function() {
#>       open_help(self$man)
#>     }
#> <environment: 0x563addc97780>
pryr::object_size(x)
#> 1.08 MB

probably some kind of promise being evaluated

mb706 commented 11 months ago

The srcfile attribute of the srcref attribute is an environment that contains the field lines, which is a promise:

substitute(lines, attr(attr(x$train_task$help, "srcref"), "srcfile")$original)
#> lazyLoadDBfetch(c(344L, 114431L), datafile, compressed, envhook)
sebffischer commented 11 months ago

Thanks! so the promise ensures that some object (whose size depends on the loaded packages) is part of the rds file and once the promise is evaluated this data is freed and the object size changes, correct?

mb706 commented 11 months ago

What is happening is that the srcref is itself a short vector of line numbers / file positions (or some other kind of index), together with an attribute that contains a representation of the content of the source file. This representation is an environment that has the $lines member, which is a promise -- it is only evaluated once someone accesses it. This happens e.g. when printing a function, which uses the srcref together with the source file content to print the text of the function. Before printing, the $lines field is an unevaluated promise, containing the expression and (large) environment in which it is evaluated. After printing, $lines contains the actual result (I guess individual lines of source files) and its eval.env is nulled.

The offender here is the envenv entry of the environment of the $lines promise:

(Don't know how to inspect the promise's environment with base R, and even pryr seems to be a bit convoluted, since it can only inspect symbols, not members of environments?)

prominfo <- evalq(pi(lines), list(pi = pryr::promise_info), attr(attr(x$train_task$help, "srcref"), "srcfile")$original)

prom_env <- prominfo$env

names(prom_env$envenv)
#>  [1] "env::150" "env::151" "env::152" "env::10"  "env::157" "env::13"
#> .......

It appears to contain lots of environments. Maybe they are all environments that can be accessed from within mlr3, or maybe they are all the environments loaded in total? In the latter case, the influence of loading other packages would be obvious, in the former the influence would be indirect, since loading other packages makes the dictionaries, like mlr_learners, bigger.

Interestingly, printing a single function from within x makes the whole object smaller, since the "srcfile" attribute is an environment that is shared between all functions in the R6 object (and all other functions loaded from the package, or loaded from the same RDS-file). There is only a single $lines promise that can be triggered.

x = readRDS(pth)
y = readRDS(pth)
pryr::object_size(x)
#> 4.00 MB
pryr::object_size(y)
#> 4.00 MB
x$train_task$help
#> function() {
#>       open_help(self$man)
#>     }
#> <environment: 0x563addc97780>
pryr::object_size(x)
#> 1.08 MB
pryr::object_size(y)
#> 4.00 MB

It is also possible to trigger the lines promise before saving, which removes the offending environment and makes all objects the same size again. The following gives 1.08 MB for me, with and without the library(mlr3verse) line.

library(mlr3verse)
#> Loading required package: mlr3
library(mlr3)

task = tsk("iris")
learner = lrn("classif.rpart")

learner$train(task)

pth = tempfile(fileext = ".rds")
learner$state$train_task$help
#> function() {
#>       open_help(self$man)
#>     }
#> <environment: 0x563addc97780>
saveRDS(learner$state, pth)

x = readRDS(pth)

pryr::object_size(x)
#> 1.08 MB