Design and discussion issue how to deal with the following:
Some algorithms and packages produce distributional predictions that are incomplete, in the sense that they specify a full predictive distribution almost but not entirely.
This is in tension with the predict_proba interface which states that it returns a full distribution (full as in, fully specified).
Examples of such returns are Kaplan-Meier or conditional survival function (= one minus cdf) estimates, where function evaluates are available only at some points of the prediction range, rather than over the entire range.
A conrete example output - given by both scikit-survival and lifeline packages - is a 2D numpy array, with one index corresponding to instances on the test/inference set, and the other index corresponding to time points at which the survival function is evaluated. Entries are the predicted survival for the given instance.
Even if we make the approximative assumption that the predicted distribution is supported only at the time points observed in the training data (i.e., sum of weighted delta), there are boundary effects which prevent a bijective mapping onto fully specified probability distributions.
For instance, consider the predictions where survival is estimated as constant zero, or constant one - here, the survival model makes a reasonable prediction that the instances dies before, or survives until afer the first or last point in the training data.
Similar boundary effects occur when attempting to mapping onto an empirical distribution.
These are not severe, if the first and last probability are close to one and zero, respectively, but are the more impactful the more this does not hold.
There are multiple questions in this:
if we return Empirical distibutions, what is the best choice?
or are there better choices of returned distributions?
taking even more steps back, should there be a separate interface point or separat object type even, for incomplete distributions? Or, improper distributions?
Design and discussion issue how to deal with the following:
Some algorithms and packages produce distributional predictions that are incomplete, in the sense that they specify a full predictive distribution almost but not entirely.
This is in tension with the
predict_proba
interface which states that it returns a full distribution (full as in, fully specified).Examples of such returns are Kaplan-Meier or conditional survival function (= one minus cdf) estimates, where function evaluates are available only at some points of the prediction range, rather than over the entire range.
A conrete example output - given by both
scikit-survival
andlifeline
packages - is a 2Dnumpy
array, with one index corresponding to instances on the test/inference set, and the other index corresponding to time points at which the survival function is evaluated. Entries are the predicted survival for the given instance.Even if we make the approximative assumption that the predicted distribution is supported only at the time points observed in the training data (i.e., sum of weighted delta), there are boundary effects which prevent a bijective mapping onto fully specified probability distributions.
For instance, consider the predictions where survival is estimated as constant zero, or constant one - here, the survival model makes a reasonable prediction that the instances dies before, or survives until afer the first or last point in the training data. Similar boundary effects occur when attempting to mapping onto an empirical distribution.
These are not severe, if the first and last probability are close to one and zero, respectively, but are the more impactful the more this does not hold.
There are multiple questions in this:
Empirical
distibutions, what is the best choice?