mlr-org / mlr3proba

Probabilistic Learning for mlr3
https://mlr3proba.mlr-org.com/
GNU Lesser General Public License v3.0
130 stars 20 forks source link

`p_max` parameter for C-Index #383

Closed vlegoff closed 6 months ago

vlegoff commented 6 months ago

Hello mlr3proba team,

In Uno's article about the C-Index, he mentions truncating the C-Index with a prespecified τ:

where τ is a prespecified time point such that pr(D > τ) > 0

the following being the justification for this:

the tail part of the estimated survival function of T is rather unstable

This is possible in the actual implementation of the C-index through the cutoff parameter, but when working with multiple datasets (e.g. in a benchmark), it would be interesting to use a censoring proportion p_max, in the same way as with the Graf score.

Reference Uno, H., Cai, T., Pencina, M. J., D'Agostino, R. B., & Wei, L. J. (2011). On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in medicine, 30(10), 1105–1117. https://doi.org/10.1002/sim.4154

bblodfon commented 6 months ago

Thanks, I will have a look at the PR

bblodfon commented 6 months ago

@vlegoff I refined the PR and merged it to the main branch, let me know if anything goes super wrong! cutoff arg is now t_max.

I think it would be interesting to use all data (train and test) to estimate the censoring distribution used for weighting, in the same way as in the Graf score & with the same justification.

We never use both train and test data for G(t) (not even in graf). But certainly we use all of training data first, before applying the t_max cutoff. See these lines where the estimation happens and later we give the t_max to the C function which filters observations pretty much

vlegoff commented 6 months ago

We never use both train and test data for G(t) (not even in graf)

Yes, I misread the doc for the Graf score, thanks for point it out!