havakv / pycox

Survival analysis with PyTorch
BSD 2-Clause "Simplified" License
803 stars 188 forks source link

About the events input for kaplan meier estimator #91

Closed musikisomorphie closed 3 years ago

musikisomorphie commented 3 years ago

Hi @havakv, thanks for your great work. It really helps my research a lot. I just have one minor issue regarding the input for kaplan meier function: in your implementation of add_km_censor() function, you feed 1 - self.events to the kaplan_meier:

def add_km_censor(self, steps='post'):
        """Add censoring estimates obtained by Kaplan-Meier on the test set
        (durations, 1-events).
        """
        km = utils.kaplan_meier(self.durations, 1-self.events)
        surv = pd.DataFrame(np.repeat(km.values.reshape(-1, 1), len(self.durations), axis=1),
                            index=km.index)
        return self.add_censor_est(surv, steps)

When I check the kaplan function, the survive is computed as follow:

def kaplan_meier(durations, events, start_duration=0):
  ...
    survive = 1 - di / ni
    zero_survive = survive == 0
    if zero_survive.any():
        i = np.argmax(zero_survive)
        surv = np.zeros_like(survive)
        surv[:i] = np.exp(np.log(survive[:i]).cumsum())
        # surv[i:] = surv[i-1]
        surv[i:] = 0.
    else:
        surv = np.exp(np.log(1 - di / ni).cumsum())
    if start_duration < surv_idx.min():
        tmp = np.ones(len(surv)+ 1, dtype=surv.dtype)
        tmp[1:] = surv
        surv = tmp
        tmp = np.zeros(len(surv_idx)+ 1, dtype=surv_idx.dtype)
        tmp[1:] = surv_idx
        surv_idx = tmp
    surv = pd.Series(surv, surv_idx)
    return surv

where di in survive = 1 - di / ni is basically the sum of events computed in the _group_loop(...). According to the definition of kaplan meier estimator (https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator), it seems to me that we should feed self.eventsinstead of 1 - self.events to utils.kaplan_meier(...)above. Maybe I am missing some other details. Could you help me with it.? Thank you very much.

havakv commented 3 years ago

Hi @musikisomorphie! Thank you for the kind words.

I might completely misunderstand your question, but the reason for passing 1 - self.events in add_km_censor() is that it estimates the survival function of the censoring times (NOT the event times). We need these estimates of the censoring distribution to compute some scores like the IPCW Brier score.

If we want to get Kaplan-Meier estimate for the survival distribution (event times), you are correct that we should consider self.events instead.

Does this answer your question?

musikisomorphie commented 3 years ago

@havak, thank you for your reply. You answered my question, I was confused about the utility of add_km_censor() function. I thought it merely computes Kaplan-Meier estimator and didn't realize that it was an intermediate step for computing other scores. Thank you again.

havakv commented 3 years ago

Happy to help!

yuvrajiro commented 10 months ago

Hi @havakv,

I am writing a paper on survival analysis, in which I am using the evaluation module, to calculate Concordance and IBS, I have just a simple question ,The censoring distribution is calculated using the train dataset at many places but I think evaluation module use the test data to create censoring distribution, I am bit confused, please help me out understand the it.