grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
955 stars 248 forks source link

Avoid making extra copies of the forest when serializing to and from R. #187

Open jtibshirani opened 6 years ago

jtibshirani commented 6 years ago

After C++ training completes, we serialize the contents of the forest to a byte stream, and pass this up to R. This serialized forest is then passed back to subsequent C++ functions for prediction, analysis, etc.

We currently an extra copy of the forest during serialization. It's not straightforward to remove this extra copy, because of a limitation in R/ Rcpp's support of variable-sized byte streams. In particular, Rcpp raw vectors need to be presized before they are filled, so we first need to calculate the size of the serialized payload, then copy over the data (https://github.com/swager/grf/blob/master/r-package/grf/bindings/RcppUtilities.cpp#L29).

Looking at performance profiles, these extra copies cause total memory usage to spike when training has completed, and results are being piped up into R. This is likely a big contributor to the memory issues observed in #179. We should take a closer look here to see if there are any optimizations or compromises we can make.

jtibshirani commented 5 years ago

We've already made the following improvements around memory usage:

This means that for some common workflows like "train a causal forest, then calculate the average treatment effect", there will be no 2x memory spike at all. But for any operations that involve passing the R forest back to C++ (such as predicting on a new test set, calculating split frequencies, etc.), we still have memory issues.

There are two additional changes we could make to address this problem:

ginward commented 5 years ago

@jtibshirani Is there any ad hoc ways to reduce memory usage? For example, reducing the number of trees, or restricting the tree depth, or reduce the number of sub-samples used to train each tree? I currently have a dataset that is very large (5GB) without a lot of variables. But my Macbook Pro is having a difficult time to train the forest with the default setting. Just wondering if there is any workaround before I transfer my data to the cloud and do my analysis there.

susanathey commented 5 years ago

@ginward see the discussion in this issue: https://github.com/grf-labs/grf/issues/272 and the union of forests feature.

ginward commented 5 years ago

@susanathey Thanks for your kind reply. It does look like that merging forest makes a lot of sense. Just to see if I understand it correctly - if my memory is not enough to train 10000 forests, for example, I can split the task into 100 forests each and train on 100 machines in a cluster on a HPC, and then merge them into one forest. Is my understanding correct?

Thanks,

Jinhua

swager commented 5 years ago

Making the sub-sample size small should help with the memory footprint (since the memory required to store each tree scales with the number of "inbag" observations).

ginward commented 5 years ago

@swager Is there a parameter in the causal_forest function that sets the sub-sample size? Is it the ci.group.size parameter that you are referring to?

ci.group.size The forest will grow ci.group.size trees on each subsample. In order to provide
confidence intervals, ci.group.size must be at least 2. Default is 2.
swager commented 5 years ago

It's sample.fraction.

ginward commented 5 years ago

@swager Thanks! If I reduce sample.fraction, will it cause larger standard errors (i.e. lower confidence in the estimations)?

swager commented 5 years ago

In general, using a smaller sample.fraction should reduce the variance of the forest. This is because reducing the sample fraction increases the implicit "bandwidth" of the forest kernel. The cost of using a smaller sample.fraction is potentially higher bias (again, as one would expect from increasing a bandwidth).

ginward commented 5 years ago

@swager Thanks. I see that the default value for sample.fraction is 0.5 - is there a reason why it is set to be 0.5? If higher values of sample.fraction can reduce bias, wouldn't it be optimal to set it to higher values (such as 0.9)? If I set it to 1, is it equivalent to a single causal tree that uses all the data to train the forest?

swager commented 5 years ago

Yes that's right: setting sample.fraction to 1 uses all the data to train a single (honest) causal tree. The reason we set sample.fraction to 0.5 is that it's the largest value for which our bootstrap-based CIs work. (If you set sample.fraction larger, you can still make predictions, but can't get variance estimates for them.)

susanathey commented 5 years ago

Also @swager reminded me that the union of forests approach helps with computational time, but after our latest upgrades, doesn't help as much with memory as it used to, since ultimately you need to be able to keep the whole forest in memory, and our memory footprint is now proportional to the size of the forest.

erikcs commented 4 years ago

On de-serialization (currently mainly used for predicting on new X):

The Tree constructor mainly takes a collection of vector<int/double>, in the Rcpp wrappers these calls constructs new vectors<> by copying over elements from the Rcpp data structure wrapper.

Idea (?) to avoid this copy step: in the wrapper construct these vectors with a custom allocator which tells std::vector to use the memory pointed to by the underlying Rcpp pointer. This would be a read-only buffer that would not be destroyed once the serialization is done and all the underlying forest data just stays where it was.

The constructor Tree(root_node, child_node_withCustomMemoryBuffer, ...) would then construct a Tree as usual with root_node, child_node... etc moved to the member slots, but when the destruction of Tree occurs together with its underlying members nothing will happen to the MemoryBuffer because the allocator will know that this is external storage managed by someone else (R).

erikcs commented 4 years ago

The logic above is regarding R to C++, if Rcpp supported move semantics into their containers, the pass trough from C++ to R could possibly be quicker? (instead of copying from a Tree's std::vector to a Rcpp::NumericVector it could be moved)

erikcs commented 4 years ago

A sketch together with @davidahirshberg on one workaround for R/C++ copy of all the Tree vector members:

From the C++ side: create vectors with a new subclass of std::vector which gives us access to the pointer to the raw memory p. When it is time to pass data back to R: retrieve p, and set the old reference to null, so that p remains intact.

When creating a new R object, for example the array with sample ids:

sample.ids = struct { size_t = size; data = p refcount = 1 } (etc)

I.e: transferring the raw data pointer from the C++ container to a new R object.

Need to make sure: a) the compiler grf supports (gcc/clang) implements std::vector as an arraylist, with memory stored contiguously starting from a0: [a0,......,] b) all the types are the same size (R/C types)

erikcs commented 4 years ago

I think we can scrap this memory plug-and-play attempt: Dirk Eddelbuettel says it is probably not possible. Even if it was possible, it would mess with R internals that could change from version to version and be very iffy.

This leaves the existing Xptr option on the table where for serialization one could provide a custom grf.save function, or a 'refhook' argument to saveRDS

carlyls commented 2 years ago

Hi all! I am currently having this issue when using predict() on a causal_forest that I trained, where the memory spikes up and stops my iterations unless I request a very high amount of memory. For now, I will continue requesting a lot of memory and moving forward with the simulations. Does anyone have thoughts as to how much memory is necessary without being way too high? Thanks to everyone who has worked on this package!

erikcs commented 2 years ago

Hi @carlyls, if you are predicting on a new test-sample, GRF needs to re-serialize the entire forest to pass it back to C++, thus it'll require twice the memory (one copy for R, a new temporary copy for C++) which will be prohibitively costly on massive datasets.

The best you could do is avoid predicting on a new test set in the first place, you get OOB predictions for "free" during training by default (accessible through predict(forest)$predictions). Another option is to set tuning parameters which reduce the forest size. For example set the subsampling fraction to something low sample.fraction = 0.1 (i.e each tree will only "store" and "use" 10% of the data) + min.node.size in accordance with this, or the number of trees to for ex num.trees=500 if CI's are not needed.

carlyls commented 2 years ago

Thanks so much for the helpful thoughts @erikcs !

Unfortunately I do have to predict on a new test set for what I am trying to do. I will definitely play around with parameters to see if that helps. One follow-up question: I have multiple forests that I am using, where I fit a few forests each based off of a different training set but then I apply each of them to predict on the same test set. Is the issue with memory only happening when I am predicting on a test set, or do you think I need to be sure to only have one forest saved at a time? Meaning rather than fitting a few forests, saving them, and then using one at a time to predict on test sets, should I instead fit one forest at a time and use it to predict on a test set and then remove it from memory?