Closed DarioS closed 1 year ago
The sample size here is 33 and the default nodesize
is 15 for survival families. This results in a forest of stumped trees (i.e. trees where number of terminal nodes are 1). The OOB error rate is .9 which is terrible.
> o=rfsrc(Surv(time, status) ~ ., data = testData)
> o
Sample size: 33
Number of deaths: 20
Number of trees: 500
Forest terminal node size: 15
Average no. of terminal nodes: 1
No. of variables tried at each split: 5
Total no. of variables: 17
Resampling used to grow trees: swor
Resample size used to grow trees: 21
Analysis: RSF
Family: surv
Splitting rule: logrank *random*
Number of random split points: 10
(OOB) CRPS: 0.11722355
(OOB) Requested performance error: 0.90131579
Decreasing nodesize
improves results:
> o=rfsrc(Surv(time, status) ~ ., data = testData, nodesize=3)
> o
Sample size: 33
Number of deaths: 20
Number of trees: 500
Forest terminal node size: 3
Average no. of terminal nodes: 7.102
No. of variables tried at each split: 5
Total no. of variables: 17
Resampling used to grow trees: swor
Resample size used to grow trees: 21
Analysis: RSF
Family: surv
Splitting rule: logrank *random*
Number of random split points: 10
(OOB) CRPS: 0.11465175
(OOB) Requested performance error: 0.49342105
Even better (at least for error rate) are pure trees:
> o=rfsrc(Surv(time, status) ~ ., data = testData, nodesize=1)
> o
Sample size: 33
Number of deaths: 20
Number of trees: 500
Forest terminal node size: 1
Average no. of terminal nodes: 19.71
No. of variables tried at each split: 5
Total no. of variables: 17
Resampling used to grow trees: swor
Resample size used to grow trees: 21
Analysis: RSF
Family: surv
Splitting rule: logrank *random*
Number of random split points: 10
(OOB) CRPS: 0.13144386
(OOB) Requested performance error: 0.44078947
Here's the OOB predicted values from the last call
> o$predicted.oob
[1] 9.738636 5.620321 7.673575 8.183246 7.879581 7.967742 4.948571 6.889535 10.055556 9.955801 4.675824 7.458333 8.093407 2.446927
[15] 8.770950 5.163743 4.407821 5.149425 2.970238 3.965714 4.641026 6.116022 8.812155 4.194444 5.283237 3.438776 7.622222 6.945000
[29] 11.106145 10.287234 6.166667 7.675676 10.930481
Is it a bug? It doesn't display any warnings about input data problems. testData.txt for reproducing the output.