sebp / scikit-survival

Survival analysis built on top of scikit-learn
GNU General Public License v3.0
1.11k stars 213 forks source link

Terminal Node Constraint for Random Survival Forests #159

Open mastervii opened 3 years ago

mastervii commented 3 years ago

Is there any parameter/variable which states the minimum number of "uncensored samples" required to be at a leaf node? I think it's called "a minimum of d0 > 0 unique deaths" in the original paper.

sebp commented 3 years ago

Currently, d_0 is essentially fixed at 1 (see this part of the code).

I think it would make sense to make it configurable by adding it as a hyper-parameter to SurvivalTree.

mastervii commented 3 years ago

I have tested it and there were a few nodes with only censored samples. I think when denom = 0, it implies that the current node contains only censored samples (– i.e., rs_total.n_events is always 0 throughout the time points), thus, no further split will take place. Hence, this current node, however, becomes a leaf with all censored samples which breaks the constraint of d_0 > 0.

sebp commented 3 years ago

That's a very good point. In this case, it seems to be more complicated, because in sklearn's code the criterion does not determine whether a split is valid or not, but the TreeBuilder and Splitter do. One could misuse criterion.weighted_n_left, criterion.weighted_n_right and criterion.weighted_n_node_samples to only refer to uncensored samples, which would make the min_weight_fraction_leaf parameter similar to d_0.

mastervii commented 3 years ago

Good idea. So, I could just create another variable, say weighted_n_node_uncensored_samples, to keep track of the uncensored samples, together with weighted_uncensored_left and weighted_uncensored_right. Update them in update(). And if weighted_uncensored_left == 0.0 or weighted_uncensored_right == 0.0 then return -INFINITY in proxy_impurity_improvement().