szc19990412 / TransMIL

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification
325 stars 72 forks source link

Unfaithful accuracy numbers on the TCGA-NSCLC dataset #19

Closed binli123 closed 1 year ago

binli123 commented 1 year ago

According to your paper, the total number of TCGA-NSCLC slides is 993 and you used 4-fold cross-validation, so each slide is classified once in hold-out sets after the cross-validation. By checking the numbers in table 1 from your paper: image None of these accuracy values can be obtained by dividing an integer using 993. (i.e., N/993=acc does not resolve to an integer for the number of correctly classified slides N). Could you give an explanation on this?

binli123 commented 1 year ago

You also copied the accuracy and the AUC of the MIL-RNN model (0.8619, 0.9107) on the TCGA-NSCLC dataset from this paper table 2 (bottom).

The accuracy value of 0.8619 was obtained from the 210 slides testing set split (181 correct ones out of 210, 181/210=0.8619). However, cross-validation on 993 slides cannot yield this value. (if 855 out of 993 slides are correctly classified, the accuracy would have been 0.8610, and if 856 out of 993 slides are correctly classified, the accuracy would have been 0.8620)

What was your intention in copying this accuracy-AUC pair? Have you really re-evaluated others' methods or just written down some good-looking numbers? image image

binli123 commented 1 year ago

Please also comment on #8. This can also be offensive to other groups in the field.

szc19990412 commented 1 year ago

Thanks for pointing this out, there are some errors for the statistics part of the dataset. For the MIL-RNN scheme, we reproduce the results on CAMELYON16 according to the scheme proposed in the original paper. CAMELYON16 is a binary classification problem of cancer and non-cancer. We can help the network to find the cancer region in the cancer WSI based on the non-cancer WSI. For the NSCLC dataset, as a binary classification problem of cancer subtypes, there are non-cancer patches, cancer subtype 1 patches, and cancer subtype 2 patches in the dataset. Due to limited time, we did not find a good reproduction scheme at that time, so we reported the reproduction results in DSMIL. In addition, for the reproduction of all comparative experimental schemes, we have used the same experimental conditions as our own for comparison, which may result in the inability to reproduce the corresponding results of the comparative methods. There are some imprecise places in the paper, we will pay attention to these problems in future work.

Lafite-Yu commented 1 year ago

According to your paper, the total number of TCGA-NSCLC slides is 993 and you used 4-fold cross-validation, so each slide is classified once in hold-out sets after the cross-validation. By checking the numbers in table 1 from your paper: image None of these accuracy values can be obtained by dividing an integer using 993. (i.e., N/993=acc does not resolve to an integer for the number of correctly classified slides N). Could you give an explanation on this?

I am working with the TCGA-Lung dataset provided by the DSMIL code, there are 1053 slides, and this consists with the GDC Data Portal, expect for 11 slides are corrupted and thus have been excluded. I can't find how the 933 files are selected, and wishing for the author to provide a data split file for this dataset.