Confusion on calculation of C-index

rk2900 / DRSA

Deep Recurrent Survival Analysis, an auto-regressive deep model for time-to-event data analysis with censorship handling. An implementation of our AAAI 2019 paper and a benchmark for several (Python) implemented survival analysis methods.

139 stars 57 forks source link

Confusion on calculation of C-index #3

Closed weijtang closed 5 years ago

weijtang commented 5 years ago

The definition of c-index is $$P( S(t_i|x_j) > S(t_i | x_i) | t_i < t_j)$$ where $$t_i = min(z_i, bi)$$. Usually it is approximated by empirical estimation $$\sum{yi = 1}\sum{t_j > t_i} 1(S(t_i|x_j) > S(t_i | x_i))$$. In your code, line 575 in BASE_MODEL.py and line 171 in deephit.py, you use "roc_auc_score(y_batch, wb)", which empirically approximate $$P( W(b_j|x_j) > W(b_i|x_i) | y_j = 1, y_i=0)$$, where $$y_i = 1$$ if $$z_i < b_i$$ otherwise $$y_i =0$$. I feel confused about this calculation. Is it equivalent to c-index? I used the notation in your paper here.

rk2900 commented 5 years ago

There are several ways to calculate C-index. As is stated in the related works, e.g., [1] and [2], if y_i is binary, then the C-index is the AUC, i.e., the area under the Receiver Operating Characteristic (ROC) curve.

[1] Wang et al. Machine Learning for Survival Analysis: A Survey [2] Li et al. A Multi-Task Learning Formulation for Survival Analysis.

weijtang commented 5 years ago

There are several ways to calculate C-index. As is stated in the related works, e.g., [1] and [2], if y_i is binary, then the C-index is the AUC, i.e., the area under the Receiver Operating Characteristic (ROC) curve.

[1] Wang et al. Machine Learning for Survival Analysis: A Survey [2] Li et al. A Multi-Task Learning Formulation for Survival Analysis.

I have read these two papers. C-index can be viewed as a weighted sum of time-dependent AuROC. In this sentence, "If y_i is binary, then ...", what they mean is that FOR A GIVEN TIME s, every t_i = min(b_i, z_i) will have a binary label: t_i > s or t_i < s. But this "binary label" is NOT the censoring status (1 if z_i < b_i; 0 otherwise). In your code, roc_auc_score(y_batch, wb), y_batch is censoring status, if I understand correctly. That's why I feel confused.

rk2900 commented 5 years ago

In our exp. code, y means t > zi, which means "not censored". The two meanings can be inversely exchanged from each other along with the prediction is exchanged as p' = 1-p. Our label is correct, otherwise the AUC will be less than 0.5 :) which means AUC{y is censored} = 1 - AUC_{y is uncensored}.

weijtang commented 5 years ago

My point is that in the literature[1] their binary label is not the censoring status (censored or not censored) at all, neither p nor 1-p. And c-index is the weighted sum of these AUROC at all possible survival times.

rk2900 commented 5 years ago

They are exactly the same thing, both are based on cumulative distribution function. roc_auc_score function is used for AUC calculation.