sebp / scikit-survival

Survival analysis built on top of scikit-learn
GNU General Public License v3.0
1.13k stars 215 forks source link

Add predict helpers such as: predict_expectation, predict_percentile (similar to lifelines) #190

Open konradsemsch opened 3 years ago

konradsemsch commented 3 years ago

Hi! I really like your package implementation and its compatibility with sklearn, but one thing that is refraining us from using it in our solution is the lack of the above predict function implementations (particularly predict_expectation). In our case, we need to know the expected number of days a given client is expected to remain a client. Ranking observations by the obtained "arbitrary" score will not work.

Given that the above function is based on the estimated survival functions it should be possible to implement (as long as I read the lifelines code correctly):

    def predict_expectation(self, X: DataFrame, conditional_after: Optional[ndarray] = None) -> pd.Series:
        r"""
        Compute the expected lifetime, :math:`E[T]`, using covariates X. This algorithm to compute the expectation is
        to use the fact that :math:`E[T] = \int_0^\inf P(T > t) dt = \int_0^\inf S(t) dt`. To compute the integral, we use the trapezoidal rule to approximate the integral.
        Caution
        --------
        If the survival function doesn't converge to 0, then the expectation is really infinity and the returned
        values are meaningless/too large. In that case, using ``predict_median`` or ``predict_percentile`` would be better.
        Parameters
        ----------
        X: numpy array or DataFrame
            a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns
            can be in any order. If a numpy array, columns must be in the
            same order as the training data.
        conditional_after: iterable, optional
            Must be equal is size to X.shape[0] (denoted `n` above).  An iterable (array, list, series) of possibly non-zero values that represent how long the
            subject has already lived for. Ex: if :math:`T` is the unknown event time, then this represents :math:`s` in
            :math:`T | T > s`. This is useful for knowing the *remaining* hazard/survival of censored subjects.
            The new timeline is the remaining duration of the subject, i.e. normalized back to starting at 0.
        Notes
        -----
        If X is a DataFrame, the order of the columns do not matter. But
        if X is an array, then the column ordering is assumed to be the
        same as the training dataset.
        See Also
        --------
        predict_median
        predict_percentile
        """
        subjects = utils._get_index(X)
        v = self.predict_survival_function(X, conditional_after=conditional_after)[subjects]
        return pd.Series(trapz(v.values.T, v.index), index=subjects)

Reference to the lifelines implementation: 1) https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html#lifelines.fitters.coxph_fitter.SemiParametricPHFitter.predict_expectation 2) https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/fitters/coxph_fitter.py

sebp commented 3 years ago

It is correct that the expected survival time can be estimated with the integral under the survival function. However, this only leads to a reasonable estimate if the last observed time point corresponds to an event such that the survival function is zero thereafter. If that's not the case, the expected survival time is undefined without making any additional assumptions.

Typical assumptions are (i) assuming the remaining instances experience an event directly after the last observed time point (S(t) becomes zero), (ii) assuming the remaining instances never experience an event (S(t) remains constant), (iii) assuming a parametric form, e.g. exponential curve, at the tail. Finally, estimating the restricted mean survival time, i.e., an expectation over a pre-defined interval rather than all of time, is from a practical perspective the most reasonable.

I don't know how lifelines handles this, but I think, when implementing this, it is important to make it explicit that estimating the mean survival time can be tricky and avoid giving an answer for which the underlying assumptions are not clear.

konradsemsch commented 3 years ago

Hi @sebp and thanks for coming back on this! Yes, there's a note in the lifelines documentation referring to what you have mentioned:

image

In general, I would consider this a really desirable feature in one of the future releases. I find your package much better integrated with the sklearn ecosystem, but the lack of these prediction helpers prevents us from using it. An additional point in favour of implementing this is that when I check other package implementations, such as the quite famous survival package in R, it also provides an option to predict the expected survival time.

I'd be happy to contribute here with a PR 😉

sebp commented 3 years ago

If you can contribute with a PR, that would be even better.

PragyanSubedi commented 3 years ago

Any updates on this issue for implementation? @sebp @konradsemsch

sebp commented 3 years ago

@PragyanSubedi I did not have time to look into it, I'm afraid.

konradsemsch commented 3 years ago

@PragyanSubedi - I've started looking into that. I should have a first PR draft in a week.

konradsemsch commented 3 years ago

@sebp - I added a basic first PR: https://github.com/sebp/scikit-survival/pull/196. Please take a look at it and let me know if you agree. Once we have that, I'd move further with implementing tests and adding this helper to other models as well.

bobbyx27 commented 2 years ago

Hi, @sebp : I wonder why not considering a predict_hazard_function in addition to predict_cumulative_hazard_function ? Or is there an obvious and simple way to get it, that I'm missing... Thank you very much for all the great work!

sebp commented 2 years ago

By definition, it would be the derivative of the cumulative hazard function.

In general, directly estimating the hazard function is tricky.

I'm curious, what would you like to use the hazard function for?

tomikrogerus commented 2 years ago

Hi, @konradsemsch and @sebp are you still working with this predict_expectation function? Would be really interested about it.

sebp commented 2 years ago

@tomikrogerus I'm currently not working on it. PR #196 is inactive too.