kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
118 stars 18 forks source link

Faster/parallel varimp #103

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hi,

Is it possible to make the calculation of variable importance faster? I have a dataset with nearly 200,000 rows and 110 columns. The variable importance seems to take forever, after waiting for 2 days I stopped it. I gave it 12 cores and can see that the tree growing uses all 12 cores but the variable importance bit is using only a single core. I have set the nodesize to 15 as I am fitting competing risk models. Is there a way to make this faster? Or would it be better if I just permute each variable and use predictions on OOB samples? I can make this run in parallel as each variable can be processed independently from others.

Thanks.

ishwaran commented 3 years ago

VIMP will be slow for large data sets, especially for survival analysis. We have been working on ways to speed up the code. In the meantime, the following should help quite a bit:

1) Do not request VIMP in grow mode. 2) Acquire VIMP in prediction using get.tree set to a small number of trees. 3) Set "ntime" to a reasonable number, ntime = 200

So the idea is rather than trying to get VIMP for all thousands of trees, try to get it for a few trees which should be much faster.

data(pbc) o <- rfsrc(Surv(days, status) ~ ., pbc, ntree = 25000) predict(o, get.tree=1:100, importance=TRUE)$importance

ishwaran commented 3 years ago

Another technique for big data is to run a pilot forest using shallow trees. Filter variables by keeping only those variables that split a tree. When doing this the value for "mtry" should be set to the number of variables.

Here's an example with the Iowa housing data.

use housing data set

data(housing)

the original data contains lots of missing data, use fast imputation

housing2 <- impute(data = housing, splitrule = "random", fast = TRUE)

run shallow trees to find variables that split any tree

xvar.used <- rfsrc(SalePrice ~., housing2, ntree = 250, nodedepth = 4, var.used="all.trees", mtry = Inf, nsplit = 100)$var.used

now fit forest using filtered variables

xvar.keep <- names(xvar.used)[xvar.used >= 1] o <- rfsrc(SalePrice~., housing2[, c("SalePrice", xvar.keep)])

kogalur commented 3 years ago

This thread also intersects with issue #95 and issue #96. We did discover some OpenMP threading issues with requests for variable importance that manifest with bigger data. We have fixed these and will posting to CRAN, hopefully in a few days, with a build that address this issue.

ghost commented 3 years ago

Thank you both! I will try both methods. I was following one of the methods you had used for variable selection where you select variables based on minimal depth being above average minimal depth and positive variable importance. When I tried with a subset of the data, this method selects too many variables, around 80 out of 100. I am working on a model for clinical practice, where in practice it is better to have a smallish number of variables 10-20. I am thinking to use the Boruta method from https://academic.oup.com/bib/article/20/2/492/4554516. I just wanted to see if you have any strong opinion on the thresholds for variables.

Apologies for changing the subject on the thread.

Ah, but reading Prof. Ishwaran's second post could just be the way to do the whole variable selection part. I might be pushing my luck here, but would you have a reference for that? For just using variables that split any tree?

ishwaran commented 3 years ago

We are working on some new methods for variable selection and hope to have those up at some point. Regarding Boruta, well there is also the method of "knockoff variable selection" (Barber and Candes) and I'm surprised this wasn't mentioned in the paper you sent.  The knockoff method is very well known and there is some theory that has been developed for it.

On 6/29/21 11:04 AM, ledohod wrote:

Thank you both! I will try both methods. I was following one of the methods you had used for variable selection where you select variables based on minimal depth and positive variable importance. When I tried with a subset of the data, this method selects too many variables, around 80 out of 100. I am working on a model for clinical practice, where in practice it is better to have a smallish number of variables 10-20. I am thinking to use the Boruta method from https://academic.oup.com/bib/article/20/2/492/4554516 https://academic.oup.com/bib/article/20/2/492/4554516. I just wanted to see if you have any strong opinion on the thresholds for variables.

Apologies for changing the subject on the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kogalur/randomForestSRC/issues/103#issuecomment-870679487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK6G4JYXFUQ3CZ25EP6LYDTTVHOGNANCNFSM47QFQBRQ.

-- Hemant Ishwaran Associate Editor, J. Machine Learning Deputy Statistical Editor, J. Thor. Cardio. Surg. Director of Statistical Methodology Professor, Division of Biostatistics Director of Graduate Studies Don Soffer Clinical Research Center, Room 1058 1120 NW 14th Street University of Miami, Miami FL 33136

@. (preferred) @. (305) 243-5473 (office) (305) 243-5544 (fax)

http://web.ccs.miami.edu/~hishwaran

ghost commented 3 years ago

Thank you Prof. Ishwaran!

ghost commented 3 years ago

For my dataset, predict(o, get.tree=1:3, importance=TRUE)$importance took just over 4 hours, for 3 trees, with ntime = 100. I wonder whether it would be better to use the elbow method for the minimal depth? Is there a way to get minimal depth for each variable from the fully fitted model?

ishwaran commented 3 years ago

I have a fairly large CR data set, n=33,000 and p=30.  Not as big as your data set - but still useful. 

First, I used tune.nodesize() to get an approximation for a good node size. I recommend not using the default settings of the function, which starts at node size of 1, and work its way up, but start at a larger value. The function uses rfsrc.fast() and should be pretty quick, nevertheless.

Not surprisingly, I found node size should be fairly large which is typical for large n survival data sets. I set nodesize=100. I set ntime=150 but ntime=100 (your setting) is also very good.

Here's the output of my forest:

                     Sample size: 33067
                Number of events: 5401, 21076
                 Number of trees: 50
       Forest terminal node size: 100
   Average no. of terminal nodes: 227.88

No. of variables tried at each split: 6 Total no. of variables: 29 Resampling used to grow trees: swor Resample size used to grow trees: 20898 Analysis: RSF Family: surv-CR Splitting rule: logrankCR random Number of random split points: 10 (OOB) Error rate: 43.11082876%, 31.64244452%

Calculating importance was relatively fast. I am using the latest build which resolves the locking that was mentioned earlier. All my CPU's light up and it works much better than before. For 3 trees, it took 18 seconds:

system.time(predict(o, get.tree=1:3, importance=T)$importance) user system elapsed 169.674 0.308 17.908

For 25 trees, it finished in under 1 minute of elapsed time:

system.time(predict(o, get.tree=1:25, importance=T)$importance) user system elapsed 622.285 1.011 57.611

On 7/1/21 8:38 AM, ledohod wrote:

For my dataset, |predict(o, get.tree=1:3, importance=TRUE)$importance| took just over 4 hours, for 3 trees. I wonder whether it would be better to use the elbow method for the minimal depth? Is there a way to get minimal depth for each variable from the fully fitted model?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kogalur/randomForestSRC/issues/103#issuecomment-872211253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK6G4J53QAW2OQUPBUCICCTTVROUFANCNFSM47QFQBRQ.

-- Hemant Ishwaran Associate Editor, J. Machine Learning Deputy Statistical Editor, J. Thor. Cardio. Surg. Director of Statistical Methodology Professor, Division of Biostatistics Director of Graduate Studies Don Soffer Clinical Research Center, Room 1058 1120 NW 14th Street University of Miami, Miami FL 33136

@. (preferred) @. (305) 243-5473 (office) (305) 243-5544 (fax)

http://web.ccs.miami.edu/~hishwaran

ghost commented 3 years ago

Thank you for your reply Prof. Ishwaran. Setting the nodesize to a higher number would definitely help, I had set it to 15 -- the default. I will optimise for it. The dataset is highly imbalanced, the primary event is only 2.7%.

ishwaran commented 3 years ago

I won't be surprised if you need a very large node size. Even 200 might be too small. Run the tuning function and let me know as I'm curious.

ghost commented 3 years ago

I've ran this code on 178,000 by 101 data.table:

  opt <- tune.nodesize(Surv(Survival_time, Status) ~ .,
                       data = train[, -"ID"],
                       nodesizeTry = ns.grid,
                       sampsize = 10000,
                       nsplit = 2,
                       splitrule = "logrankCR",
                       ntime = 100,
                       trace = TRUE)

$err
   nodesize       err
1        20 0.1286318
2        30 0.1305858
3        40 0.1324669
4        50 0.1342076
5        60 0.1355017
6        70 0.1368552
7        80 0.1380444
8        90 0.1387118
9       100 0.1393220
10      150 0.1425560
11      200 0.1448375
12      250 0.1464231
13      300 0.1472330
14      350 0.1487477
15      400 0.1496820
16      450 0.1495343
17      500 0.1507030

I think I probably need to have smaller node sizes, maybe because of the very low rate of the primary event. I've also tried to run full model fitting with nodesize = 100, it resulted in around 1,200ish average number of terminal nodes. The predict function with importance is still taking a long time. I am still using v2.11.0. How can I get the latest build version?

ishwaran commented 3 years ago

Interesting. OK, so it looks like smaller node sizes will be needed for prediction

However variable selection is a totally different ball game. For this purpose (and this purpose alone) you can run a separate forests with a largish nodesize, say 250 or 350, and you will be fine. Variable selection is the difference in prediction performance, so you are not measuring prediction performance, but rather the difference in it. Also variable importance is a tree based concept. It doesn't actually relate to the forest performance at all.

In other words, don't let the variable selection convince that you need to have one forest to do everything. It's going to slow you down for no appreciable gain. Run as fast a forest as you can manage while keeping prediction accuracy reasonable.

We hope to have the new build out next week after the holiday weekend.

ghost commented 3 years ago

Does variable selection need to be done with the same sampsize as the final model? So, it would be 63.2% of the full dataset. I suppose a smaller sampsize will do since we are only looking at the difference in the C-index. I am thinking to set it at 15K.

ishwaran commented 3 years ago

Ironically that will make variable importance slower! Using a smaller sample size for the training data means a larger sample size for the out-of-sample (OOB) data, thereby making more work for the calculation of VIMP.

So actually you would want to increase the sampsize. That would speed things up for sure.

Also, you can try different types of importance besides permutation which is the default. For example "anti" and "random" could be much faster for big data.

ghost commented 3 years ago

OK. Thank you very much Prof. Ishwaran for your comment and your time!

ishwaran commented 3 years ago

We just placed version 2.12 on CRAN that has the improvements we discussed. Please let me know if this works satisfactorily.

ghost commented 3 years ago

Hi Prof. Ishwaran,

The parallel estimation of importance works now, I can see it using all cores given to it. For the following forest:

                        Sample size: 177838
                    Number of events: 4933, 48994
                     Number of trees: 1000
           Forest terminal node size: 15
       Average no. of terminal nodes: 7141.033
No. of variables tried at each split: 10
              Total no. of variables: 98
       Resampling used to grow trees: swor
    Resample size used to grow trees: 112394
                            Analysis: RSF
                              Family: surv-CR
                      Splitting rule: logrankCR *random*
       Number of random split points: 2
                    (OOB) Error rate: 14.37891555%, 13.66146899%

the predict(mod, get.tree = 1:100, importance = "random")$importance took around 8 hours on 16 cores.

Thank you very much for fixing this issue!

ishwaran commented 3 years ago

Glad to hear it's working out. I still think for importance you can use a much larger nodesize. We will be working on fast VIMP software in the near future.