Add Brier Score metrics

Vincent-Maladiere commented 1 year ago

What does this PR implement?

This is the implementation of the Brier score metrics module:

BrierScoreComputer
~~BrierScoreSampler~~ to be moved to the GBIncidence source code
brier_score
brier_score_incidence
integrated_brier_score
integrated_brier_score_incidence

This also implements the IPCWEstimator used by the BrierScoreComputer. We have a dependency on lifelines but not on scikit-survival.

Remarks

~~- This fixes a subtle bug in y_binary, making our previous implementation of brier_score() incorrect. It now matches scikit-survival results.~~ ~~Edit: y_true_binary is 1 when we observed an event, and 0 otherwise, which is correct for our Gradient Boosting classification or regression target. However, when computing the Brier Score metric, we actually need np.logical_not(y_true_binary). Indeed, when we observed an event, the survival probability drops to 0. Otherwise, it stays at 1. The Brier Score uses the survival probability, not the incidence probability.~~

brier_score computes is the regular Brier Score like scikit-survival does, where the y_pred input is the survival probability. On the contrary, brier_score_incidence follows the Kretowska formula for competing events and expects y_pred to be the incidence probability for the kth cause of failure instead. Under the hood though, brier_score relies on brier_score_incidence, by simply inputting 1 - y_pred.

This is needed to disambiguate our Brier Score implementation. We could also only expose brier_score_incidence without the regular brier_score, but I thought it was handy to have it, instead of telling the user to perform 1 - y_pred for binary events, all the time.
When running
```
brier_score(y_train, y_test, y_pred, times)
```
times must be the times used for computing y_pred —of shape (n_samples, n_times)— otherwise, the Brier Score will be incorrect. This is tricky to check when y_pred is not a dataframe, so we only check for shapes.

ogrisel commented 1 year ago

Let's add some tests to check some mathematical properties:

IPCWEstimator should constantly predict 1.0 when the training data has no censored values,
check that assert_allclose(ipcw.predict([0]), [1.0]) when ipcw is fit on random data, even when the shorter time in the training set is censored.
check that ipcw.predict([t + eps]) is strictly larger than 1.0 when ipcw is fit on random censored data and t is the smallest censored duration in the training set.
check that IPCWEstimator predicts monotonically increasing values on a np.linspace(0, t_max, 100) when fit on random censored event data.
check that when a training set has deterministic censoring beyond a given t value (100% of the largest training events are censored), predicting with large t values is clipped to 1 / ipcw.min_censoring_survival_prob for different values of min_censoring_survival_prob.

ogrisel commented 1 year ago

Ok @Vincent-Maladiere I think this is ready for final pass of review on your end.

ogrisel commented 1 year ago

I rendered the HTML of the doc locally but I realize that we do not have a CI job to do it for the PR. Only for publishing the result in the merge of the PR.

I think the easiest way to achieve this would be to configure a circle CI job for this project. Let's do that later in a separate PR.

soda-inria / hazardous

Add Brier Score metrics #10