Open liupei101 opened 6 months ago
Hello, Jaume! Thanks for your impressive work.
You mentioned that the results with early stopping would not be fair. I want to ask if there is evidence for this statement. I am looking for fair and universal ways to evaluate the survival models, since I find that the final observed (or reported) performance is sensitive to how we evaluate.
Concretely, when doing 5-fold cross-validation with a fixed number of epochs (T
), just one more training epoch could lead to a significantly different value on the final observed performance. In other words, the result of performance evaluation with 5-fold cross-validation is often sensitive to T
. To avoid setting a fixed T
, one could choose to adopt early stopping in the training of each fold. This way would still not be fair, as mentioned by you. One possible reason for this is that the size of the validation set is too small to support a reasonable early stopping, from my humble understanding. So, in the face of limited samples in computational pathology, what could be the fairer way to evaluate the predictive models?
Look forward to hearing from you side.
Hello, Jaume! Thanks for your impressive work.
You mentioned that the results with early stopping would not be fair. I want to ask if there is evidence for this statement. I am looking for fair and universal ways to evaluate the survival models, since I find that the final observed (or reported) performance is sensitive to how we evaluate.
Concretely, when doing 5-fold cross-validation with a fixed number of epochs (
T
), just one more training epoch could lead to a significantly different value on the final observed performance. In other words, the result of performance evaluation with 5-fold cross-validation is often sensitive toT
. To avoid setting a fixedT
, one could choose to adopt early stopping in the training of each fold. This way would still not be fair, as mentioned by you. One possible reason for this is that the size of the validation set is too small to support a reasonable early stopping, from my humble understanding. So, in the face of limited samples in computational pathology, what could be the fairer way to evaluate the predictive models?Look forward to hearing from you side.
I'm also very interested in this issue. What would be the fairest way to handle it? It seems that using a fixed epoch and an early stopping mechanism with a validation set may not yield the most reliable results.
For someone who is also seeking fair means to configure and evaluate survival analysis models, there are some facts that could be helpful.
4
by default. Moreover, their performance metric, C-Index, is calculated using the survival time label after quantile discretization. [1] Yu, C.-N., Greiner, R., Lin, H.-C., and Baracos, V. Learning patient-specific cancer survival distributions as a sequence of dependent regressors. Advances in Neural Information Processing Systems, 24:1845–1853, 2011. [2] Lee, C., Zame, W. R., Yoon, J., and van der Schaar, M. Deephit: A deep learning approach to survival analysis with competing risks. In Thirty-second AAAI conference on artificial intelligence, 2018. [3] Haider, H., Hoehn, B., Davis, S., and Greiner, R. Effective ways to build and evaluate individual survival distributions. Journal of Machine Learning Research, 21(85): 1–63, 2020. [4] Qi, S. A., Kumar, N., Farrokh, M., Sun, W., Kuan, L. H., Ranganath, R., ... & Greiner, R. (2023, July). An Effective Meaningful Way to Evaluate Survival Models. In International Conference on Machine Learning (pp. 28244-28276). PMLR.
For someone who is also seeking fair means to configure and evaluate survival analysis models, there are some facts that could be helpful.
- Early stopping seems frequently adopted in the survival analysis community, e.g., the representative MTLR [1] and DeepHit [2]. In these works, the scale of some datasets is similar to that of WSI-based survival analysis datasets (~1000).
- Discrete survival models are common, like those used in SurvPath, Patch-GCN, and MCAT. In the survival analysis community, setting the number of discrete times to the square root of uncensored patient numbers is often suggested, as stated in the JMLR paper [3] and the ICML paper [4]. In addition, although the prediction is discrete, the survival time label is still continuous in performance evaluation, e.g., C-Index calculation.
- SurvPath, Patch-GCN, and MCAT set the number of discrete times to
4
by default. Moreover, their performance metric, C-Index, is calculated using the survival time label after quantile discretization.[1] Yu, C.-N., Greiner, R., Lin, H.-C., and Baracos, V. Learning patient-specific cancer survival distributions as a sequence of dependent regressors. Advances in Neural Information Processing Systems, 24:1845–1853, 2011. [2] Lee, C., Zame, W. R., Yoon, J., and van der Schaar, M. Deephit: A deep learning approach to survival analysis with competing risks. In Thirty-second AAAI conference on artificial intelligence, 2018. [3] Haider, H., Hoehn, B., Davis, S., and Greiner, R. Effective ways to build and evaluate individual survival distributions. Journal of Machine Learning Research, 21(85): 1–63, 2020. [4] Qi, S. A., Kumar, N., Farrokh, M., Sun, W., Kuan, L. H., Ranganath, R., ... & Greiner, R. (2023, July). An Effective Meaningful Way to Evaluate Survival Models. In International Conference on Machine Learning (pp. 28244-28276). PMLR.
Hi, Pei! I have a question. If I am not mistaken, it seems that in SurvPath, during 5-fold cross-validation, the validation set = the test set. This differs from DeepHit, which splits the dataset into training and test sets in an 8:2 ratio and then performs 5-fold cross-validation on the training set. DeepHit's approach explicitly separates the test set and validation set, making the use of early stopping understandable. However, with SurvPath's data splitting method, how can we ensure that the test set remains unknown and is not leaked when using early stopping?
I believe the current data splitting method has issues. Another drawback is that some datasets have very small sample sizes, and perhaps a few-shot approach might be more appropriate.
Originally posted by @guillaumejaume in https://github.com/mahmoodlab/SurvPath/issues/4#issuecomment-2004749355