Closed ghost closed 3 years ago
VIMP will be slow for large data sets, especially for survival analysis. We have been working on ways to speed up the code. In the meantime, the following should help quite a bit:
1) Do not request VIMP in grow mode. 2) Acquire VIMP in prediction using get.tree set to a small number of trees. 3) Set "ntime" to a reasonable number, ntime = 200
So the idea is rather than trying to get VIMP for all thousands of trees, try to get it for a few trees which should be much faster.
data(pbc) o <- rfsrc(Surv(days, status) ~ ., pbc, ntree = 25000) predict(o, get.tree=1:100, importance=TRUE)$importance
Another technique for big data is to run a pilot forest using shallow trees. Filter variables by keeping only those variables that split a tree. When doing this the value for "mtry" should be set to the number of variables.
Here's an example with the Iowa housing data.
data(housing)
housing2 <- impute(data = housing, splitrule = "random", fast = TRUE)
xvar.used <- rfsrc(SalePrice ~., housing2, ntree = 250, nodedepth = 4, var.used="all.trees", mtry = Inf, nsplit = 100)$var.used
xvar.keep <- names(xvar.used)[xvar.used >= 1] o <- rfsrc(SalePrice~., housing2[, c("SalePrice", xvar.keep)])
This thread also intersects with issue #95 and issue #96. We did discover some OpenMP threading issues with requests for variable importance that manifest with bigger data. We have fixed these and will posting to CRAN, hopefully in a few days, with a build that address this issue.
Thank you both! I will try both methods. I was following one of the methods you had used for variable selection where you select variables based on minimal depth being above average minimal depth and positive variable importance. When I tried with a subset of the data, this method selects too many variables, around 80 out of 100. I am working on a model for clinical practice, where in practice it is better to have a smallish number of variables 10-20. I am thinking to use the Boruta method from https://academic.oup.com/bib/article/20/2/492/4554516. I just wanted to see if you have any strong opinion on the thresholds for variables.
Apologies for changing the subject on the thread.
Ah, but reading Prof. Ishwaran's second post could just be the way to do the whole variable selection part. I might be pushing my luck here, but would you have a reference for that? For just using variables that split any tree?
We are working on some new methods for variable selection and hope to have those up at some point. Regarding Boruta, well there is also the method of "knockoff variable selection" (Barber and Candes) and I'm surprised this wasn't mentioned in the paper you sent. The knockoff method is very well known and there is some theory that has been developed for it.
On 6/29/21 11:04 AM, ledohod wrote:
Thank you both! I will try both methods. I was following one of the methods you had used for variable selection where you select variables based on minimal depth and positive variable importance. When I tried with a subset of the data, this method selects too many variables, around 80 out of 100. I am working on a model for clinical practice, where in practice it is better to have a smallish number of variables 10-20. I am thinking to use the Boruta method from https://academic.oup.com/bib/article/20/2/492/4554516 https://academic.oup.com/bib/article/20/2/492/4554516. I just wanted to see if you have any strong opinion on the thresholds for variables.
Apologies for changing the subject on the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kogalur/randomForestSRC/issues/103#issuecomment-870679487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK6G4JYXFUQ3CZ25EP6LYDTTVHOGNANCNFSM47QFQBRQ.
-- Hemant Ishwaran Associate Editor, J. Machine Learning Deputy Statistical Editor, J. Thor. Cardio. Surg. Director of Statistical Methodology Professor, Division of Biostatistics Director of Graduate Studies Don Soffer Clinical Research Center, Room 1058 1120 NW 14th Street University of Miami, Miami FL 33136
@. (preferred) @. (305) 243-5473 (office) (305) 243-5544 (fax)
Thank you Prof. Ishwaran!
For my dataset, predict(o, get.tree=1:3, importance=TRUE)$importance
took just over 4 hours, for 3 trees, with ntime = 100
. I wonder whether it would be better to use the elbow method for the minimal depth? Is there a way to get minimal depth for each variable from the fully fitted model?
I have a fairly large CR data set, n=33,000 and p=30. Not as big as your data set - but still useful.
First, I used tune.nodesize() to get an approximation for a good node size. I recommend not using the default settings of the function, which starts at node size of 1, and work its way up, but start at a larger value. The function uses rfsrc.fast() and should be pretty quick, nevertheless.
Not surprisingly, I found node size should be fairly large which is typical for large n survival data sets. I set nodesize=100. I set ntime=150 but ntime=100 (your setting) is also very good.
Here's the output of my forest:
Sample size: 33067
Number of events: 5401, 21076
Number of trees: 50
Forest terminal node size: 100
Average no. of terminal nodes: 227.88
No. of variables tried at each split: 6 Total no. of variables: 29 Resampling used to grow trees: swor Resample size used to grow trees: 20898 Analysis: RSF Family: surv-CR Splitting rule: logrankCR random Number of random split points: 10 (OOB) Error rate: 43.11082876%, 31.64244452%
Calculating importance was relatively fast. I am using the latest build which resolves the locking that was mentioned earlier. All my CPU's light up and it works much better than before. For 3 trees, it took 18 seconds:
system.time(predict(o, get.tree=1:3, importance=T)$importance) user system elapsed 169.674 0.308 17.908
For 25 trees, it finished in under 1 minute of elapsed time:
system.time(predict(o, get.tree=1:25, importance=T)$importance) user system elapsed 622.285 1.011 57.611
On 7/1/21 8:38 AM, ledohod wrote:
For my dataset, |predict(o, get.tree=1:3, importance=TRUE)$importance| took just over 4 hours, for 3 trees. I wonder whether it would be better to use the elbow method for the minimal depth? Is there a way to get minimal depth for each variable from the fully fitted model?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kogalur/randomForestSRC/issues/103#issuecomment-872211253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK6G4J53QAW2OQUPBUCICCTTVROUFANCNFSM47QFQBRQ.
-- Hemant Ishwaran Associate Editor, J. Machine Learning Deputy Statistical Editor, J. Thor. Cardio. Surg. Director of Statistical Methodology Professor, Division of Biostatistics Director of Graduate Studies Don Soffer Clinical Research Center, Room 1058 1120 NW 14th Street University of Miami, Miami FL 33136
@. (preferred) @. (305) 243-5473 (office) (305) 243-5544 (fax)
Thank you for your reply Prof. Ishwaran. Setting the nodesize to a higher number would definitely help, I had set it to 15 -- the default. I will optimise for it. The dataset is highly imbalanced, the primary event is only 2.7%.
I won't be surprised if you need a very large node size. Even 200 might be too small. Run the tuning function and let me know as I'm curious.
I've ran this code on 178,000 by 101 data.table
:
opt <- tune.nodesize(Surv(Survival_time, Status) ~ .,
data = train[, -"ID"],
nodesizeTry = ns.grid,
sampsize = 10000,
nsplit = 2,
splitrule = "logrankCR",
ntime = 100,
trace = TRUE)
$err
nodesize err
1 20 0.1286318
2 30 0.1305858
3 40 0.1324669
4 50 0.1342076
5 60 0.1355017
6 70 0.1368552
7 80 0.1380444
8 90 0.1387118
9 100 0.1393220
10 150 0.1425560
11 200 0.1448375
12 250 0.1464231
13 300 0.1472330
14 350 0.1487477
15 400 0.1496820
16 450 0.1495343
17 500 0.1507030
I think I probably need to have smaller node sizes, maybe because of the very low rate of the primary event. I've also tried to run full model fitting with nodesize = 100
, it resulted in around 1,200ish average number of terminal nodes. The predict function with importance is still taking a long time. I am still using v2.11.0. How can I get the latest build version?
Interesting. OK, so it looks like smaller node sizes will be needed for prediction
However variable selection is a totally different ball game. For this purpose (and this purpose alone) you can run a separate forests with a largish nodesize, say 250 or 350, and you will be fine. Variable selection is the difference in prediction performance, so you are not measuring prediction performance, but rather the difference in it. Also variable importance is a tree based concept. It doesn't actually relate to the forest performance at all.
In other words, don't let the variable selection convince that you need to have one forest to do everything. It's going to slow you down for no appreciable gain. Run as fast a forest as you can manage while keeping prediction accuracy reasonable.
We hope to have the new build out next week after the holiday weekend.
Does variable selection need to be done with the same sampsize
as the final model? So, it would be 63.2% of the full dataset. I suppose a smaller sampsize
will do since we are only looking at the difference in the C-index. I am thinking to set it at 15K.
Ironically that will make variable importance slower! Using a smaller sample size for the training data means a larger sample size for the out-of-sample (OOB) data, thereby making more work for the calculation of VIMP.
So actually you would want to increase the sampsize
. That would speed things up for sure.
Also, you can try different types of importance besides permutation which is the default. For example "anti" and "random" could be much faster for big data.
OK. Thank you very much Prof. Ishwaran for your comment and your time!
We just placed version 2.12 on CRAN that has the improvements we discussed. Please let me know if this works satisfactorily.
Hi Prof. Ishwaran,
The parallel estimation of importance works now, I can see it using all cores given to it. For the following forest:
Sample size: 177838
Number of events: 4933, 48994
Number of trees: 1000
Forest terminal node size: 15
Average no. of terminal nodes: 7141.033
No. of variables tried at each split: 10
Total no. of variables: 98
Resampling used to grow trees: swor
Resample size used to grow trees: 112394
Analysis: RSF
Family: surv-CR
Splitting rule: logrankCR *random*
Number of random split points: 2
(OOB) Error rate: 14.37891555%, 13.66146899%
the predict(mod, get.tree = 1:100, importance = "random")$importance
took around 8 hours on 16 cores.
Thank you very much for fixing this issue!
Glad to hear it's working out. I still think for importance you can use a much larger nodesize. We will be working on fast VIMP software in the near future.
Hi,
Is it possible to make the calculation of variable importance faster? I have a dataset with nearly 200,000 rows and 110 columns. The variable importance seems to take forever, after waiting for 2 days I stopped it. I gave it 12 cores and can see that the tree growing uses all 12 cores but the variable importance bit is using only a single core. I have set the nodesize to 15 as I am fitting competing risk models. Is there a way to make this faster? Or would it be better if I just permute each variable and use predictions on OOB samples? I can make this run in parallel as each variable can be processed independently from others.
Thanks.