Closed oelhammouchi closed 10 months ago
We are looking into this and have some ideas. However can you first tell me how many variables (attributes) you have in your models? Also is this a regression, classification or survival data setting?
It's a survival model with ~15 variables.
The issue is the option save.memory="TRUE"
used by rfsrc.fast
. This (and other choices for the function) have been selected so that rfsrc.fast
can be used for rapid training. However this function is not necessarily optimized for prediction.
Given that this is a big data set, I highly recommend that you work directly with rfsrc
which is after all what rfsrc.fast
does ultimately.
I trained and then tested a survival model with 1 million cases and p=20 variables with the following parameter settings. Training on my machine was about 1 minute and prediction on a single test data point was about 3 seconds. Note in particular the option perf.type="none"
used for training which turns off all performance values. Using this option you will no longer be able to tell how well your forest performs (OOB error rate and so forth) but is needed for fast training. If you want to tune parameters, try reducing your data set sample size (say 200K) and tune on that prior to the call below.
ntree <- 100
nsplit <- 10
mysampsize <- function(x) {min(x * 0.632, max(150, x^(3/4)))}
mynodesize <- max(1, nrow(mydta) * .001)
ntime <- 50
o <- rfsrc(myf, mydta,
perf.type="none",
save.memory=FALSE,
ntree=ntree,
nsplit=nsplit,
sampsize=mysampsize,
nodesize=mynodesize,
ntime=ntime)$forest
p <- predict(o, mytestdata)$predicted
It worked! Thank you so very much, you have no idea how much this helped us! The only minor problem we're still left with is that the same speedup in prediction time is not observed with na.action = "na.impute"
, it still takes ~30 s per prediction. For now we've 'resolved' this by manually imputing the mean ourselves before training, but it would obviously be quite nice to use your sophisticated imputation algorithm. Is there any easy way of resolving this, do you think? In any case, thanks again for your help!
I want to mention something else regarding fast prediction times - after which I will get to your question about prediction with missing data.
You should consider using the function rfsrc.anonymous
which is especially tuned for the purpose of prediction. We call it anonymous because it only saves basic information from the training data and is designed for researchers who want to share their trained random forest object with others - but without giving up personal information. A relevant side effect is that it produces a trained forest that is highly efficient for prediction.
The following illustrates it's use, which is basically the same as above, but switches rfsrc
for rfsrc.anonymous
o <- rfsrc.anonymous(myf, mydta,
perf.type="none",
save.memory=FALSE,
ntree=ntree,
nsplit=nsplit,
sampsize=mysampsize,
nodesize=mynodesize,
ntime=ntime)$forest
p <- predict(o, mytestdata)$predicted
Now prediction on a test data point should be about 100 times faster. On my machine it takes about 0.3 seconds.
Next, I want to mention that if your goal is ultimately prediction then time taken to train the forest might not be such a big issue given that you might be more interested in accuracy. Therefore if you remove the option sampsize=mysampsize
, then the sample size will revert back to the usual bootstrap size 63.2% and it will take longer to train, but the resulting forest will have better out-of-sample prediction performance. On my machine this increases the training time from about 60 seconds to about 900 seconds. This is not bad if you only need to train once and the primary goal is prediction.
Finally, returning to the issue of prediction with missing test data. Using na.action="na.impute"
is generally slow and therefore there is an option na.random
for this purpose that will speed things up. The call looks like
p <- predict(o, mytestdata, na.action="na.random")$predicted
Note that if the forest is trained using rfsrc.anonymous
then na.action="na.impute" is generally fast because a fast rough impute is used since it's not possible to implement the true na.action="na.impute"
due to the fact that the training data is not saved. Therefore in this setting, whatever you do, prediction on missing test data will be faster.
Thank you so much for your explanation, these suggestions work very well! Indeed, it wasn't reallly clear to me based on the documentation how rfsrc.anonymous
could be combined with the imputation mechanism. Many thanks for the package and your help!
Hi! First, thank you so much for your work on this package! I'm using it in a dashboard I'm building to analyse failures of industrial equipment, but I'm faced with the problem that prediction is too slow. Even after removal of the low-importance features, reducing
ntree
from 1000 to 100 and increasingnodesize
to about 1% of the data, it still takes ~30 seconds for a single prediction using a model constructed withrfsrc.fast
, which is not feasible for a dashboard. My training data consists of ~1.8 million rows and I'm working on a cluster with 128GB of RAM and 32 cores. Do you have any advice on how to improve this?