kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
113 stars 18 forks source link

`predict` too slow #389

Closed oelhammouchi closed 10 months ago

oelhammouchi commented 10 months ago

Hi! First, thank you so much for your work on this package! I'm using it in a dashboard I'm building to analyse failures of industrial equipment, but I'm faced with the problem that prediction is too slow. Even after removal of the low-importance features, reducing ntree from 1000 to 100 and increasing nodesize to about 1% of the data, it still takes ~30 seconds for a single prediction using a model constructed with rfsrc.fast, which is not feasible for a dashboard. My training data consists of ~1.8 million rows and I'm working on a cluster with 128GB of RAM and 32 cores. Do you have any advice on how to improve this?

ishwaran commented 10 months ago

We are looking into this and have some ideas. However can you first tell me how many variables (attributes) you have in your models? Also is this a regression, classification or survival data setting?

oelhammouchi commented 10 months ago

It's a survival model with ~15 variables.

ishwaran commented 10 months ago

The issue is the option save.memory="TRUE" used by rfsrc.fast. This (and other choices for the function) have been selected so that rfsrc.fast can be used for rapid training. However this function is not necessarily optimized for prediction.

Given that this is a big data set, I highly recommend that you work directly with rfsrc which is after all what rfsrc.fast does ultimately.

I trained and then tested a survival model with 1 million cases and p=20 variables with the following parameter settings. Training on my machine was about 1 minute and prediction on a single test data point was about 3 seconds. Note in particular the option perf.type="none" used for training which turns off all performance values. Using this option you will no longer be able to tell how well your forest performs (OOB error rate and so forth) but is needed for fast training. If you want to tune parameters, try reducing your data set sample size (say 200K) and tune on that prior to the call below.

ntree <- 100
nsplit <- 10
mysampsize <- function(x) {min(x * 0.632, max(150, x^(3/4)))} 
mynodesize <- max(1, nrow(mydta) * .001)
ntime <- 50

o  <- rfsrc(myf, mydta,
            perf.type="none",
            save.memory=FALSE,
            ntree=ntree,
            nsplit=nsplit,
            sampsize=mysampsize,
            nodesize=mynodesize,
            ntime=ntime)$forest

p <- predict(o, mytestdata)$predicted
oelhammouchi commented 10 months ago

It worked! Thank you so very much, you have no idea how much this helped us! The only minor problem we're still left with is that the same speedup in prediction time is not observed with na.action = "na.impute", it still takes ~30 s per prediction. For now we've 'resolved' this by manually imputing the mean ourselves before training, but it would obviously be quite nice to use your sophisticated imputation algorithm. Is there any easy way of resolving this, do you think? In any case, thanks again for your help!

ishwaran commented 10 months ago

I want to mention something else regarding fast prediction times - after which I will get to your question about prediction with missing data.

You should consider using the function rfsrc.anonymous which is especially tuned for the purpose of prediction. We call it anonymous because it only saves basic information from the training data and is designed for researchers who want to share their trained random forest object with others - but without giving up personal information. A relevant side effect is that it produces a trained forest that is highly efficient for prediction.

The following illustrates it's use, which is basically the same as above, but switches rfsrc for rfsrc.anonymous

o  <- rfsrc.anonymous(myf, mydta,
            perf.type="none",
            save.memory=FALSE,
            ntree=ntree,
            nsplit=nsplit,
            sampsize=mysampsize,
            nodesize=mynodesize,
            ntime=ntime)$forest

p <- predict(o, mytestdata)$predicted

Now prediction on a test data point should be about 100 times faster. On my machine it takes about 0.3 seconds.

Next, I want to mention that if your goal is ultimately prediction then time taken to train the forest might not be such a big issue given that you might be more interested in accuracy. Therefore if you remove the option sampsize=mysampsize, then the sample size will revert back to the usual bootstrap size 63.2% and it will take longer to train, but the resulting forest will have better out-of-sample prediction performance. On my machine this increases the training time from about 60 seconds to about 900 seconds. This is not bad if you only need to train once and the primary goal is prediction.

Finally, returning to the issue of prediction with missing test data. Using na.action="na.impute" is generally slow and therefore there is an option na.random for this purpose that will speed things up. The call looks like

p <- predict(o, mytestdata, na.action="na.random")$predicted

Note that if the forest is trained using rfsrc.anonymous then na.action="na.impute" is generally fast because a fast rough impute is used since it's not possible to implement the true na.action="na.impute" due to the fact that the training data is not saved. Therefore in this setting, whatever you do, prediction on missing test data will be faster.

oelhammouchi commented 10 months ago

Thank you so much for your explanation, these suggestions work very well! Indeed, it wasn't reallly clear to me based on the documentation how rfsrc.anonymous could be combined with the imputation mechanism. Many thanks for the package and your help!