mlr-org / mlr3

mlr3: Machine Learning in R - next generation
https://mlr3.mlr-org.com
GNU Lesser General Public License v3.0
941 stars 85 forks source link

saveRDS is extremely slow when saving resample results, reading these files is RAM intensive #482

Closed missuse closed 4 years ago

missuse commented 4 years ago

When larger resample results (in 10s of MB) are saved using saveRDS the process is extremely slow:

In clean session:

memory.size()
[1] 81.85
library(mlr3)
library(mlr3tuning)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
gc_tsk = tsk("german_credit")

rngr = lrn("classif.ranger")

ps = ParamSet$new(
  list(
    ParamInt$new("mtry", lower = 1L, upper = 10L),
    ParamDbl$new("sample.fraction", lower = 0.5, upper = 1),
    ParamInt$new("num.trees", lower = 5L, upper = 200L)
  ))

cv5 = rsmp("cv", folds = 5)
cv4 = rsmp("cv", folds = 4)

at = AutoTuner$new(
  learner = rngr,
  resampling = cv4,
  measures = msr("classif.mcc"),
  tune_ps = ps,
  terminator = term("evals", n_evals = 100),
  tuner = tnr("random_search"))

#takes a couple of minutes
rr = resample(task = gc_tsk,
              learner = at,
              resampling = cv5,
              store_models = TRUE)
memory.size()
[1] 565.76
a = Sys.time()
saveRDS(rr, "rr.rds") #size on disk 35 mb
Sys.time() - a
Time difference of 1.612301 mins
a = Sys.time()
rr2 = readRDS("rr.rds")
Sys.time() - a
Time difference of 17.02873 secs 
memory.size()
[1] 4354.04
rr3 = readRDS("rr.rds")
> memory.size()
[1] 8327.77

For comparison:

m = matrix(rnorm(1e7), 10000, 1000)

a = Sys.time()
saveRDS(m, "m.rds") #size on disk 75 mb
Sys.time() - a
Time difference of 2.732558 secs
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] paradox_0.1.0-9000       mlr3learners_0.1.6-9000  mlr3pipelines_0.1.2.9000 mlr3tuning_0.1.2-9000   
[5] mlr3_0.1.8              

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3              lgr_0.3.4               lattice_0.20-38         mlr3misc_0.1.8         
 [5] digest_0.6.25           crayon_1.3.4            withr_2.1.2             grid_3.6.1             
 [9] ranger_0.11.2           R6_2.4.1                mlr3measures_0.1.2-9000 backports_1.1.6        
[13] uuid_0.1-4              data.table_1.12.8       rstudioapi_0.10         Matrix_1.2-17          
[17] rpart_4.1-15            checkmate_2.0.0         tools_3.6.1             compiler_3.6.1

When a similar example is run with xgboost with more trees, the RAM usage is huge ~25 - 30 GB. a 250 MB resample result (on disk) save takes about an hour.

The issue is not present when

at$store_tuning_instance = FALSE

then the resample results are much smaller.

pat-s commented 4 years ago

It looks like were running into https://github.com/r-lib/R6/issues/157.

https://github.com/d-sharpe/pickleR aims to tackle the serialize/unserialize memory explosion. However, it fails on mlr3 R6 objects.

More refs:

library(mlr3)
library(mlr3tuning)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
# install_github("d-sharpe/pickleR")
lgr::get_logger("mlr3")$set_threshold("warn")

gc_tsk = tsk("german_credit")

rngr = lrn("classif.ranger")

ps = ParamSet$new(
  list(
    ParamInt$new("mtry", lower = 1L, upper = 10L),
    ParamDbl$new("sample.fraction", lower = 0.5, upper = 1),
    ParamInt$new("num.trees", lower = 5L, upper = 200L)
  ))

cv5 = rsmp("cv", folds = 5)
cv4 = rsmp("cv", folds = 4)

at = AutoTuner$new(
  learner = rngr,
  resampling = cv4,
  measure = msr("classif.mcc"),
  search_space = ps,
  terminator = trm("evals", n_evals = 100),
  tuner = tnr("random_search"))

#takes a couple of minutes
rr = resample(task = gc_tsk,
              learner = at,
              resampling = cv5,
              store_models = TRUE)

lobstr::obj_size(rr)
#> 112,679,896 B
lobstr::mem_used()
#> 365,984,856 B

rr2 = pickleR::unpickle(pickleR::pickle(rr))
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> Error in object_space_env[[address]] <- values: wrong args for environment subassignment
lobstr::obj_size(rr2)
#> Error in list2(...): object 'rr2' not found

Created on 2020-04-06 by the reprex package (v0.3.0)

mllg commented 4 years ago

Problem are the learners:

roundtripdiff = function(x) {
  path = tempfile()
  saveRDS(x, file = path)
  y = readRDS(path)
  file.remove(path)

  as.numeric(lobstr::obj_size(y)) / as.numeric(lobstr::obj_size(x))
}

task = tsk("mtcars")
learner = lrn("regr.lm")
rr = resample(task, learner, rsmp("bootstrap", repeats = 1000))
roundtripdiff(rr$data$learner)

I guess we have 3 options:

1) Try to reproduce this problem using only environments, and kindly ask the right person from R core to look into it. 2) Wait for something like pickleR to solve this for us. 3) Write our own save_mlr_obj() to store only the relevant data and read_mlr_obj() to restore a saved object.

pat-s commented 4 years ago

Maybe 3) until 1) might happen because 2) might never happen?

mllg commented 4 years ago

Here is an example for environments:

ee = new.env()
for (i in 1:1000) {
  ee[[sprintf("f%04i", i)]] = stats::lm
}

roundtripdiff(ee)

lobstr::obj_addr(ee$f0001); lobstr::obj_addr(ee$f0002) # same address

ee2 = unserialize(serialize(ee, NULL))
lobstr::obj_addr(ee2$f0001); lobstr::obj_addr(ee2$f0002) # different addresses
mllg commented 4 years ago

Ok, there is already some kind of mechanism to deal with reference objects during serialization via argument refhook in serialize()/unserialize(). Quoting from the comment in the C source file:

A mechanism is provided to allow special handling of non-system reference objects (all weak references and external pointers, and all environments other than package environments, namespace environments, and the global environment). The hook function consists of a function pointer and a data value. The serialization function pointer is called with the reference object and the data value as arguments. It should return R_NilValue for standard handling and an STRSXP for special handling. If an STRSXP is returned, then a special handing mark is written followed by the strings in the STRSXP (attributes are ignored). On unserializing, any specially marked entry causes a call to the hook function with the reconstructed STRSXP and data value as arguments. This should return the value to use for the reference object. A reasonable convention on how to use this mechanism is neded, but again the format should be compatible with any reasonable convention.

Eventually it may be useful to use these hooks to allow objects with a class to have a class-specific serialization mechanism. The serialization format should support this. It is trickier than in Java and other reference based languages where creation and initialization can be separated--we don't really have that option at the R level.

After reading the docs and the C code, I'm still not sure how to use refhooks, and also confused how this argument is used without serialize being S3. Also, I'm quite surprised that some objects with reference semantics are properly handled, but others (like regular environments) are causing problems here. Maybe @kalibera can help out (or say that this won't be fixed in the foreseeable future so that we can start looking for a different solution).

I also wonder if it would be possible to serialize all R6 classes with the refhook argument in a generic way ... @wch ?

mllg commented 4 years ago

Note: There are 2 packages on CRAN which use refhooks: liquidSVM and rsdmx. Both packages provide custom functions to save and read objects.

kalibera commented 4 years ago

The mechanism as implemented does not need use of S3, even though it was perhaps anticipated that S3 would work fine with that. The hooks just serialize objects in a custom way, it is used for instance in lazy loading (source code in base R), but perhaps some packages could have simpler examples of use. Environments are special because they have referential semantics (not value semantics), which we want to be able to preserve. See also ?serialize.

mllg commented 4 years ago

The mechanism as implemented does not need use of S3, even though it was perhaps anticipated that S3 would work fine with that. The hooks just serialize objects in a custom way, it is used for instance in lazy loading (source code in base R), but perhaps some packages could have simpler examples of use. Environments are special because they have referential semantics (not value semantics), which we want to be able to preserve. See also ?serialize.

Got this. But I don't get what we are supposed to do now with our objects (which are basically nested environments). These get serialized either by the user calling save()/saveRDS() or during parallelization, and we would like to keep the reference semantics because otherwise everything blows up and becomes unusable. Am I'm missing something?

kalibera commented 4 years ago

The hook is not mandatory for useful serialization of environments, R will serialize environments by its default algorithm and this algorithm preserves identity by reference within a single serialization stream (in the code you can see a reference table, REFSXP, OutRefIndex, etc). But if you have multiple streams, say multiple rds files, but wanted to ensure referential semantics across them, then you could achieve that via a hook. In principle a hook would do some custom serialization into a store with a string key, will have a way to map its input to that key (e.g. environment address, or some element of the environment), and will return the key to R when serializing. The lazy loading database is formed by many serialization streams (inside the same file) where identity of environments across those streams is achieved this way using hooks (instead of environments, their unique string identifiers are serialized, and the environment contents is saved elsewhere).

kalibera commented 4 years ago

Looking at earlier messages in this thread: if you save an environment and then load it in the same R session, where the original environment still exists, you will get a copy of the original environment. If you wanted to re-use the in-memory environment, you could also do it in hooks, you could have some kind of uuid inside each of your environment and an in-memory hashmap mapping uuids to environments. The hook will return the uuid on serialization and save the environment externally if not saved. On deserialization it would check for the uuid in the hashmap and if present, return the in-memory version. If not, load a version from the external store (which can again be created using R serialization). Of course, you run into consistency questions when the environment changes its content when already serialized, etc: that needs to be taken care of by application-specific means (probably by not mutating them). Intuitively, by serialization you are serializing a copy of the environment, because referential semantics cannot be enforced on mutation (changing in memory will not update the serialized data and vice-versa).

berndbischl commented 4 years ago

because we dive even deeper into details or potential solution: could someone please summarize the general problem compactly?

mllg commented 4 years ago

because we dive even deeper into details or potential solution: could someone please summarize the general problem compactly?

In a nutshell: during serialization, the reference to in-memory objects is lost. This first results in large files, then results in large objects after un-serialization. This unfortunately also is observable if you save the same object multiple times in the same RDS file (what we do a lot, e.g. all methods of R6 objects get duplicated).

berndbischl commented 4 years ago

In a nutshell: during serialization, the reference to in-memory objects is lost. This first results in large files, then results in large objects after un-serialization. This unfortunately also is observable if you save the same object multiple times in the same RDS file (what we do a lot, e.g. all methods of R6 objects get duplicated).

I still don't precisely understand this: "the reference to in-memory objects is lost." Can you please clarify and also include an example. Does this mean, if we store object A, and A references object B from multiple places, but B is a singelton in memory, B is copied multiple times? Note that if your answer is yes to the above, I think this not only results in a size problem, but the complete validity of A is broken. Because if B is mutable, the user can now create a situation where A references multiple variants of B with different states.

Can we please very quickly and very precisely and carefully clear this up here? This begins to sound very worrysome

pat-s commented 4 years ago

This is an upstream issue of R6 in general (or the way R6 handles environments) - there might not be a quick or easy solution to this. Let's see, maybe Michel can do some magic.

Bernd: Michel already posted examples in previous comments.

berndbischl commented 4 years ago

This is an upstream issue of R6 in general (or the way R6 handles environments)

I understood that. and I have read the complete thread, with most of the links recursively. my questions still (!) maintain. should we do a call about that?

berndbischl commented 4 years ago

the point is: if we (and other R6 projects) cannot properly serialize (and it currently sounds like this? but see my non-full-understanding above!) we are screwed.

wch commented 4 years ago

I don't know the details of how objects are structured in mlr3, but here's the crux of the issue when serializing multiple R6 objects (from https://github.com/r-lib/R6/issues/157#issuecomment-559240276):

x <- lapply(1:1000, function(i) {
  function() i
})

object_size(x)
#> 355 kB

x_copy <- unserialize(serialize(x, version = 3, connection = NULL))
object_size(x_copy)
#> 2.07 MB

Each of the functions in the list has the same body and formals, but a different environment. In the original list, each function has a reference to the same body and formals object in memory (though each will have its own unique environment), but in x_copy, each function will refer to a separate body and formals object in memory.

You can see this by comparing the output of these two:

.Internal(inspect(x[[2]]))
.Internal(inspect(x[[3]]))

and these two:

.Internal(inspect(x_copy[[2]]))
.Internal(inspect(x_copy[[3]]))

Notes:

If you are willing to write C/C++ code to deal with this, you could traverse the objects and look for duplicate references before serializing. When you see duplicates, you'll have to do something clever before writing, and after restoring to de-duplicate the shared references. See https://github.com/r-lib/lobstr/blob/master/src/size.cpp an example of C++ code that traverses objects and looks at memory addresses.

berndbischl commented 4 years ago

I don't know the details of how objects are structured in mlr3

nearly every object in mlr3 is an R6 object. with the "usual OO" design. with that I mean, that a class can, of course, use composition as a design pattern. which means, many of our classes have member variables which reference other R6 objects. in many cases (not always), these member vars point to objects which exist as a singelton in memory.

so a situation like this: Class B { contains ref to A. }

We construct object "a" from class A - once. Objects of class B are constructed 10 times. Often like this b1 = B$new(a); b2 = B$new(a).

(edited due to stupid typos)

mllg commented 4 years ago

This problem should be solved for ResampleResult and BenchmarkResult. Still need an optimization for mlr3tuning, but we are getting there soon.

mllg commented 4 years ago

Optimizations for tuning and feature selection now implemented.

missuse commented 4 years ago

I did some tests and it appears to work great. Thanks @mllg.

mllg commented 4 years ago

I did some tests and it appears to work great. Thanks @mllg.

Thanks for the feedback, closing here.

hbsmith commented 1 year ago

Hi there (@mllg?), I'm still running into this issue (and it looks like I'm not the only one https://stackoverflow.com/questions/76366910/how-to-save-mlr3-resample-object-results-to-disk).

Was there some regression? Or did I not follow the conversation correctly and this was never fixed?

be-marc commented 1 year ago

@hbsmith Do you run mlr3 in renv?

hbsmith commented 1 year ago

Sorry, I'm rather new to R--I'm running in a VSCode jupyter notebook. I'm not using renv as far as I know.

sebffischer commented 1 year ago

Can you run attributes(mlr3::benchmark_grid) and post the output here please?

sebffischer commented 1 year ago

also packageVersion("mlr3misc") would be useful