sktime / skpro

A unified framework for tabular probabilistic regression, time-to-event prediction, and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
238 stars 45 forks source link

[ENH] design - dealing with incomplete distributions such as predictive survival function estimates #249

Open fkiraly opened 6 months ago

fkiraly commented 6 months ago

Design and discussion issue how to deal with the following:

Some algorithms and packages produce distributional predictions that are incomplete, in the sense that they specify a full predictive distribution almost but not entirely.

This is in tension with the predict_proba interface which states that it returns a full distribution (full as in, fully specified).

Examples of such returns are Kaplan-Meier or conditional survival function (= one minus cdf) estimates, where function evaluates are available only at some points of the prediction range, rather than over the entire range.

A conrete example output - given by both scikit-survival and lifeline packages - is a 2D numpy array, with one index corresponding to instances on the test/inference set, and the other index corresponding to time points at which the survival function is evaluated. Entries are the predicted survival for the given instance.

Even if we make the approximative assumption that the predicted distribution is supported only at the time points observed in the training data (i.e., sum of weighted delta), there are boundary effects which prevent a bijective mapping onto fully specified probability distributions.

For instance, consider the predictions where survival is estimated as constant zero, or constant one - here, the survival model makes a reasonable prediction that the instances dies before, or survives until afer the first or last point in the training data. Similar boundary effects occur when attempting to mapping onto an empirical distribution.

These are not severe, if the first and last probability are close to one and zero, respectively, but are the more impactful the more this does not hold.

There are multiple questions in this:

fkiraly commented 6 months ago

@VascoSch92, this may be of interest to you because of: