kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
113 stars 18 forks source link

All Samples Have Identical Predicted Risk Score #384

Closed DarioS closed 1 year ago

DarioS commented 1 year ago

Is it a bug? It doesn't display any warnings about input data problems. testData.txt for reproducing the output.

testData <- read.delim("testData.txt")
model <- rfsrc(Surv(time, status) ~ ., data = testData)
> predict(model, testData)$predicted
  16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461
  16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461
  16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461 16.33461
ishwaran commented 1 year ago

The sample size here is 33 and the default nodesize is 15 for survival families. This results in a forest of stumped trees (i.e. trees where number of terminal nodes are 1). The OOB error rate is .9 which is terrible.

> o=rfsrc(Surv(time, status) ~ ., data = testData)
> o
                         Sample size: 33
                    Number of deaths: 20
                     Number of trees: 500
           Forest terminal node size: 15
       Average no. of terminal nodes: 1
No. of variables tried at each split: 5
              Total no. of variables: 17
       Resampling used to grow trees: swor
    Resample size used to grow trees: 21
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          (OOB) CRPS: 0.11722355
   (OOB) Requested performance error: 0.90131579

Decreasing nodesize improves results:

> o=rfsrc(Surv(time, status) ~ ., data = testData, nodesize=3)
> o
                         Sample size: 33
                    Number of deaths: 20
                     Number of trees: 500
           Forest terminal node size: 3
       Average no. of terminal nodes: 7.102
No. of variables tried at each split: 5
              Total no. of variables: 17
       Resampling used to grow trees: swor
    Resample size used to grow trees: 21
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          (OOB) CRPS: 0.11465175
   (OOB) Requested performance error: 0.49342105

Even better (at least for error rate) are pure trees:

> o=rfsrc(Surv(time, status) ~ ., data = testData, nodesize=1)
> o
                         Sample size: 33
                    Number of deaths: 20
                     Number of trees: 500
           Forest terminal node size: 1
       Average no. of terminal nodes: 19.71
No. of variables tried at each split: 5
              Total no. of variables: 17
       Resampling used to grow trees: swor
    Resample size used to grow trees: 21
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          (OOB) CRPS: 0.13144386
   (OOB) Requested performance error: 0.44078947

Here's the OOB predicted values from the last call

> o$predicted.oob

 [1]  9.738636  5.620321  7.673575  8.183246  7.879581  7.967742  4.948571  6.889535 10.055556  9.955801  4.675824  7.458333  8.093407  2.446927
[15]  8.770950  5.163743  4.407821  5.149425  2.970238  3.965714  4.641026  6.116022  8.812155  4.194444  5.283237  3.438776  7.622222  6.945000
[29] 11.106145 10.287234  6.166667  7.675676 10.930481