Closed missuse closed 4 years ago
It looks like were running into https://github.com/r-lib/R6/issues/157.
https://github.com/d-sharpe/pickleR aims to tackle the serialize/unserialize memory explosion. However, it fails on mlr3 R6 objects.
More refs:
library(mlr3)
library(mlr3tuning)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
# install_github("d-sharpe/pickleR")
lgr::get_logger("mlr3")$set_threshold("warn")
gc_tsk = tsk("german_credit")
rngr = lrn("classif.ranger")
ps = ParamSet$new(
list(
ParamInt$new("mtry", lower = 1L, upper = 10L),
ParamDbl$new("sample.fraction", lower = 0.5, upper = 1),
ParamInt$new("num.trees", lower = 5L, upper = 200L)
))
cv5 = rsmp("cv", folds = 5)
cv4 = rsmp("cv", folds = 4)
at = AutoTuner$new(
learner = rngr,
resampling = cv4,
measure = msr("classif.mcc"),
search_space = ps,
terminator = trm("evals", n_evals = 100),
tuner = tnr("random_search"))
#takes a couple of minutes
rr = resample(task = gc_tsk,
learner = at,
resampling = cv5,
store_models = TRUE)
lobstr::obj_size(rr)
#> 112,679,896 B
lobstr::mem_used()
#> 365,984,856 B
rr2 = pickleR::unpickle(pickleR::pickle(rr))
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
#> Error in object_space_env[[address]] <- values: wrong args for environment subassignment
lobstr::obj_size(rr2)
#> Error in list2(...): object 'rr2' not found
Created on 2020-04-06 by the reprex package (v0.3.0)
Problem are the learners:
roundtripdiff = function(x) {
path = tempfile()
saveRDS(x, file = path)
y = readRDS(path)
file.remove(path)
as.numeric(lobstr::obj_size(y)) / as.numeric(lobstr::obj_size(x))
}
task = tsk("mtcars")
learner = lrn("regr.lm")
rr = resample(task, learner, rsmp("bootstrap", repeats = 1000))
roundtripdiff(rr$data$learner)
I guess we have 3 options:
1) Try to reproduce this problem using only environments, and kindly ask the right person from R core to look into it.
2) Wait for something like pickleR to solve this for us.
3) Write our own save_mlr_obj()
to store only the relevant data and read_mlr_obj()
to restore a saved object.
Maybe 3) until 1) might happen because 2) might never happen?
Here is an example for environments:
ee = new.env()
for (i in 1:1000) {
ee[[sprintf("f%04i", i)]] = stats::lm
}
roundtripdiff(ee)
lobstr::obj_addr(ee$f0001); lobstr::obj_addr(ee$f0002) # same address
ee2 = unserialize(serialize(ee, NULL))
lobstr::obj_addr(ee2$f0001); lobstr::obj_addr(ee2$f0002) # different addresses
Ok, there is already some kind of mechanism to deal with reference objects during serialization via argument refhook
in serialize()
/unserialize()
. Quoting from the comment in the C source file:
A mechanism is provided to allow special handling of non-system reference objects (all weak references and external pointers, and all environments other than package environments, namespace environments, and the global environment). The hook function consists of a function pointer and a data value. The serialization function pointer is called with the reference object and the data value as arguments. It should return R_NilValue for standard handling and an STRSXP for special handling. If an STRSXP is returned, then a special handing mark is written followed by the strings in the STRSXP (attributes are ignored). On unserializing, any specially marked entry causes a call to the hook function with the reconstructed STRSXP and data value as arguments. This should return the value to use for the reference object. A reasonable convention on how to use this mechanism is neded, but again the format should be compatible with any reasonable convention.
Eventually it may be useful to use these hooks to allow objects with a class to have a class-specific serialization mechanism. The serialization format should support this. It is trickier than in Java and other reference based languages where creation and initialization can be separated--we don't really have that option at the R level.
After reading the docs and the C code, I'm still not sure how to use refhooks, and also confused how this argument is used without serialize being S3. Also, I'm quite surprised that some objects with reference semantics are properly handled, but others (like regular environments) are causing problems here. Maybe @kalibera can help out (or say that this won't be fixed in the foreseeable future so that we can start looking for a different solution).
I also wonder if it would be possible to serialize all R6 classes with the refhook
argument in a generic way ... @wch ?
Note: There are 2 packages on CRAN which use refhooks
: liquidSVM and rsdmx. Both packages provide custom functions to save and read objects.
The mechanism as implemented does not need use of S3, even though it was perhaps anticipated that S3 would work fine with that. The hooks just serialize objects in a custom way, it is used for instance in lazy loading (source code in base R), but perhaps some packages could have simpler examples of use. Environments are special because they have referential semantics (not value semantics), which we want to be able to preserve. See also ?serialize.
The mechanism as implemented does not need use of S3, even though it was perhaps anticipated that S3 would work fine with that. The hooks just serialize objects in a custom way, it is used for instance in lazy loading (source code in base R), but perhaps some packages could have simpler examples of use. Environments are special because they have referential semantics (not value semantics), which we want to be able to preserve. See also ?serialize.
Got this. But I don't get what we are supposed to do now with our objects (which are basically nested environments). These get serialized either by the user calling save()
/saveRDS()
or during parallelization, and we would like to keep the reference semantics because otherwise everything blows up and becomes unusable. Am I'm missing something?
The hook is not mandatory for useful serialization of environments, R will serialize environments by its default algorithm and this algorithm preserves identity by reference within a single serialization stream (in the code you can see a reference table, REFSXP, OutRefIndex, etc). But if you have multiple streams, say multiple rds files, but wanted to ensure referential semantics across them, then you could achieve that via a hook. In principle a hook would do some custom serialization into a store with a string key, will have a way to map its input to that key (e.g. environment address, or some element of the environment), and will return the key to R when serializing. The lazy loading database is formed by many serialization streams (inside the same file) where identity of environments across those streams is achieved this way using hooks (instead of environments, their unique string identifiers are serialized, and the environment contents is saved elsewhere).
Looking at earlier messages in this thread: if you save an environment and then load it in the same R session, where the original environment still exists, you will get a copy of the original environment. If you wanted to re-use the in-memory environment, you could also do it in hooks, you could have some kind of uuid inside each of your environment and an in-memory hashmap mapping uuids to environments. The hook will return the uuid on serialization and save the environment externally if not saved. On deserialization it would check for the uuid in the hashmap and if present, return the in-memory version. If not, load a version from the external store (which can again be created using R serialization). Of course, you run into consistency questions when the environment changes its content when already serialized, etc: that needs to be taken care of by application-specific means (probably by not mutating them). Intuitively, by serialization you are serializing a copy of the environment, because referential semantics cannot be enforced on mutation (changing in memory will not update the serialized data and vice-versa).
because we dive even deeper into details or potential solution: could someone please summarize the general problem compactly?
because we dive even deeper into details or potential solution: could someone please summarize the general problem compactly?
In a nutshell: during serialization, the reference to in-memory objects is lost. This first results in large files, then results in large objects after un-serialization. This unfortunately also is observable if you save the same object multiple times in the same RDS file (what we do a lot, e.g. all methods of R6 objects get duplicated).
In a nutshell: during serialization, the reference to in-memory objects is lost. This first results in large files, then results in large objects after un-serialization. This unfortunately also is observable if you save the same object multiple times in the same RDS file (what we do a lot, e.g. all methods of R6 objects get duplicated).
I still don't precisely understand this: "the reference to in-memory objects is lost." Can you please clarify and also include an example. Does this mean, if we store object A, and A references object B from multiple places, but B is a singelton in memory, B is copied multiple times? Note that if your answer is yes to the above, I think this not only results in a size problem, but the complete validity of A is broken. Because if B is mutable, the user can now create a situation where A references multiple variants of B with different states.
Can we please very quickly and very precisely and carefully clear this up here? This begins to sound very worrysome
This is an upstream issue of R6 in general (or the way R6 handles environments) - there might not be a quick or easy solution to this. Let's see, maybe Michel can do some magic.
Bernd: Michel already posted examples in previous comments.
This is an upstream issue of R6 in general (or the way R6 handles environments)
I understood that. and I have read the complete thread, with most of the links recursively. my questions still (!) maintain. should we do a call about that?
the point is: if we (and other R6 projects) cannot properly serialize (and it currently sounds like this? but see my non-full-understanding above!) we are screwed.
I don't know the details of how objects are structured in mlr3, but here's the crux of the issue when serializing multiple R6 objects (from https://github.com/r-lib/R6/issues/157#issuecomment-559240276):
x <- lapply(1:1000, function(i) {
function() i
})
object_size(x)
#> 355 kB
x_copy <- unserialize(serialize(x, version = 3, connection = NULL))
object_size(x_copy)
#> 2.07 MB
Each of the functions in the list has the same body and formals, but a different environment. In the original list, each function has a reference to the same body and formals object in memory (though each will have its own unique environment), but in x_copy
, each function will refer to a separate body and formals object in memory.
You can see this by comparing the output of these two:
.Internal(inspect(x[[2]]))
.Internal(inspect(x[[3]]))
and these two:
.Internal(inspect(x_copy[[2]]))
.Internal(inspect(x_copy[[3]]))
Notes:
[[1]]
because R's byte code compiler does not compile the first one, but it does compile the later ones, and so the content of [[1]]
is differentIf you are willing to write C/C++ code to deal with this, you could traverse the objects and look for duplicate references before serializing. When you see duplicates, you'll have to do something clever before writing, and after restoring to de-duplicate the shared references. See https://github.com/r-lib/lobstr/blob/master/src/size.cpp an example of C++ code that traverses objects and looks at memory addresses.
I don't know the details of how objects are structured in mlr3
nearly every object in mlr3 is an R6 object. with the "usual OO" design. with that I mean, that a class can, of course, use composition as a design pattern. which means, many of our classes have member variables which reference other R6 objects. in many cases (not always), these member vars point to objects which exist as a singelton in memory.
so a situation like this: Class B { contains ref to A. }
We construct object "a" from class A - once. Objects of class B are constructed 10 times. Often like this b1 = B$new(a); b2 = B$new(a).
(edited due to stupid typos)
This problem should be solved for ResampleResult and BenchmarkResult. Still need an optimization for mlr3tuning, but we are getting there soon.
Optimizations for tuning and feature selection now implemented.
I did some tests and it appears to work great. Thanks @mllg.
I did some tests and it appears to work great. Thanks @mllg.
Thanks for the feedback, closing here.
Hi there (@mllg?), I'm still running into this issue (and it looks like I'm not the only one https://stackoverflow.com/questions/76366910/how-to-save-mlr3-resample-object-results-to-disk).
Was there some regression? Or did I not follow the conversation correctly and this was never fixed?
@hbsmith Do you run mlr3 in renv?
Sorry, I'm rather new to R--I'm running in a VSCode jupyter notebook. I'm not using renv as far as I know.
Can you run attributes(mlr3::benchmark_grid)
and post the output here please?
also packageVersion("mlr3misc")
would be useful
When larger resample results (in 10s of MB) are saved using
saveRDS
the process is extremely slow:In clean session:
For comparison:
When a similar example is run with xgboost with more trees, the RAM usage is huge ~25 - 30 GB. a 250 MB resample result (on disk) save takes about an hour.
The issue is not present when
at$store_tuning_instance = FALSE
then the resample results are much smaller.