kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
115 stars 18 forks source link

How can the var.select.rfsrc function give a depth threshold higher than any minimal depth? #439

Open NickvGennip opened 3 months ago

NickvGennip commented 3 months ago

I used the rfsrc function to estimate the random survival forest for competing risks. The output shows the minimal depth for each variable in my model, but also the threshold for minimal depth variable selection. As I understand it, the threshold is calculated as the mean of the distribution of the means of the minimal depths of the individual variables. However, in my output the threshold is higher than all minimal depths of the individual variables and hence all variables are selected. How is this possible, as the mean will always be lower than the maximum? Here is the code how I implemented it and the output. As you can see, the depth threshold is 8.1338, while the highest minimal depth of the individual variables is 4.535. RSF

ishwaran commented 3 months ago

The minimal depth threshold holds under certain conditions that possibly are not being met in your example. The original motivation was for large dimension and small sample size (p>n), while you have the opposite in your example (n>p).

In competing risk, the minimal depth is usually used when the goal is to determine which variables are informative for long term probabilities of events.

I would suggest using the subsample function to calculate confidence intervals for the cause specific VIMP - it looks like you are interested in cause = 1 in your example. Then you can use the minimal depth as a secondary analysis to confirm your findings.

NickvGennip commented 3 months ago

Thank you for the elaboration. If I would use confidence intervals for VIMP, how would I then determine which variables to remove from the analysis? Or do I then just say for example, if VIMP is negative, then I remove it from the data?

Op di 2 jul 2024 15:56 schreef ishwaran @.***>:

The minimal depth threshold holds under certain conditions that possibly are not being met in your example. The original motivation was for large dimension and small sample size (p>n), while you have the opposite in your example (n>p).

In competing risk, the minimal depth is usually used when the goal is to determine which variables are informative for long term probabilities of events.

I would suggest using the subsample function to calculate confidence intervals for the cause specific VIMP - it looks like you are interested in cause = 1 in your example. Then you can use the minimal depth as a secondary analysis to confirm your findings.

— Reply to this email directly, view it on GitHub https://github.com/kogalur/randomForestSRC/issues/439#issuecomment-2203257304, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3M6TFZC6N2HR5U5PG7KEATZKKWQBAVCNFSM6AAAAABKBVTUR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBTGI2TOMZQGQ . You are receiving this because you authored the thread.Message ID: @.***>

ishwaran commented 3 months ago

You would include only variables whose confidence interval excludes zero.