Closed sebffischer closed 11 months ago
good find, hadn't thought about the problem that srcref can take up lots of memory
I still don't understand why the measures object size depends on the packages that are loaded, do you have an idea why?
e.g. when saving a learner state (when having installed mlr3 with --with-keep.source) the result returned by pryr::object_size()
depends on whether mlr3tuning (or other mlr3 packages) are loaded or not.
Consider:
library(mlr3verse)
#> Loading required package: mlr3
library(mlr3)
task = tsk("iris")
learner = lrn("classif.rpart")
learner$train(task)
pth = tempfile(fileext = ".rds")
saveRDS(learner$state, pth)
x = readRDS(pth)
pryr::object_size(x)
#> 19.49 MB
vs
library(mlr3)
task = tsk("iris")
learner = lrn("classif.rpart")
learner$train(task)
pth = tempfile(fileext = ".rds")
saveRDS(learner$state, pth)
x = readRDS(pth)
pryr::object_size(x)
#> 4.00 MB
It gets worse:
library("mlr3")
task = tsk("iris")
learner = lrn("classif.rpart")
learner$train(task)
pth = tempfile(fileext = ".rds")
saveRDS(learner$state, pth)
x = readRDS(pth)
pryr::object_size(x)
#> 4.00 MB
x$train_task$help
#> function() {
#> open_help(self$man)
#> }
#> <environment: 0x563addc97780>
pryr::object_size(x)
#> 1.08 MB
probably some kind of promise being evaluated
The srcfile
attribute of the srcref
attribute is an environment that contains the field lines
, which is a promise:
substitute(lines, attr(attr(x$train_task$help, "srcref"), "srcfile")$original)
#> lazyLoadDBfetch(c(344L, 114431L), datafile, compressed, envhook)
Thanks! so the promise ensures that some object (whose size depends on the loaded packages) is part of the rds file and once the promise is evaluated this data is freed and the object size changes, correct?
What is happening is that the srcref
is itself a short vector of line numbers / file positions (or some other kind of index), together with an attribute that contains a representation of the content of the source file. This representation is an environment that has the $lines
member, which is a promise -- it is only evaluated once someone accesses it. This happens e.g. when printing a function, which uses the srcref
together with the source file content to print the text of the function. Before printing, the $lines
field is an unevaluated promise, containing the expression and (large) environment in which it is evaluated. After printing, $lines
contains the actual result (I guess individual lines of source files) and its eval.env
is nulled.
The offender here is the envenv
entry of the environment of the $lines
promise:
(Don't know how to inspect the promise's environment with base R, and even pryr
seems to be a bit convoluted, since it can only inspect symbols, not members of environments?)
prominfo <- evalq(pi(lines), list(pi = pryr::promise_info), attr(attr(x$train_task$help, "srcref"), "srcfile")$original)
prom_env <- prominfo$env
names(prom_env$envenv)
#> [1] "env::150" "env::151" "env::152" "env::10" "env::157" "env::13"
#> .......
It appears to contain lots of environments. Maybe they are all environments that can be accessed from within mlr3
, or maybe they are all the environments loaded in total? In the latter case, the influence of loading other packages would be obvious, in the former the influence would be indirect, since loading other packages makes the dictionaries, like mlr_learners
, bigger.
Interestingly, printing a single function from within x
makes the whole object smaller, since the "srcfile"
attribute is an environment that is shared between all functions in the R6 object (and all other functions loaded from the package, or loaded from the same RDS-file). There is only a single $lines
promise that can be triggered.
x = readRDS(pth)
y = readRDS(pth)
pryr::object_size(x)
#> 4.00 MB
pryr::object_size(y)
#> 4.00 MB
x$train_task$help
#> function() {
#> open_help(self$man)
#> }
#> <environment: 0x563addc97780>
pryr::object_size(x)
#> 1.08 MB
pryr::object_size(y)
#> 4.00 MB
It is also possible to trigger the lines
promise before saving, which removes the offending environment and makes all objects the same size again. The following gives 1.08 MB for me, with and without the library(mlr3verse)
line.
library(mlr3verse)
#> Loading required package: mlr3
library(mlr3)
task = tsk("iris")
learner = lrn("classif.rpart")
learner$train(task)
pth = tempfile(fileext = ".rds")
learner$state$train_task$help
#> function() {
#> open_help(self$man)
#> }
#> <environment: 0x563addc97780>
saveRDS(learner$state, pth)
x = readRDS(pth)
pryr::object_size(x)
#> 1.08 MB
When compiling R with
--with-keep.source
, serialized objects were gigantic (and dependent on the loaded packages), see this issue: https://github.com/mlr-org/mlr3misc/issues/88I tested that when installing mlr3 with
--with-keep.source
with this version of mlr3misc, the problem disappears. This also caused the failed workflows in the mlr3book@berndbischl @mllg @mb706