pretraining distribution

tranh215980 commented 2 months ago

Dear authors,

What is pretraining distribution of CTransPath and CHIEF? Is new CHIEF CTransPath model the same as previous one in Medical Image Analysis?

tranh215980 commented 2 months ago

Dear authors,

I did not examine all details of the and found pretraining slides in CHIEF in Supplementary Table 13. If discussion not yet closed I have more ddquestions.

In Table 13 does “pos. slide number” mean tumor and “neg. slide number” mean normal? Is CHIEF pretraining based on this positive-negative tumor prediction for 19 organs?
Total slides in Table 13 is 60,530 and this also is number of slides reported in main paper for pretraining. I am not familiar with slide counts for other datasets but I look in TCGA that number of brain tumor slides is over 1700 and number of lung tumor slides is over 1000. These numbers are same as number pos slides for these organs. If all slides in TCGA are used is their potential data leakage issue when first pretrain CHIEF with positive-negative tumor prediction task and second evaluate on TCGA mutation prediction / tumor origin prediction?

I am interested in making my own CHIEF model and compare in same setting as CHIEF.

Dadatata-JZ commented 2 months ago

Hi Tran,

As you may know, the pre-training was done using a weakly-supervised approach, utilizing these 60,530 slides. The aim of this pre-training was to build CHIEF with a "comprehensive" understanding of histology images.

tranh215980 commented 2 months ago

Hi @Dadatata-JZ,

What is this weakly-supervised approach? SCL-WC?

Eshay14 commented 2 months ago

Hi @Dadatata-JZ,

What is this weakly-supervised approach? SCL-WC?

Yes, I would like to know this too.

Eshay14 commented 2 months ago

@tranh215980 perhaps we can help each other. As far as I can understand, the authors pre-trained on a lot of public data not in a self-supervised manner but in a weakly-supervised manner USING a label for each slide. And then a lot of the SAME DATA was used for downstream evaluation tasks using fine tuning. I hope this is not the case, as it would unequivocally be data contamination, but from the information we have available in the paper and here this is what it seems like.

Dadatata-JZ commented 2 months ago

@Eshay14 thanks for your attempt to help explain @tranh215980 's questions.

To clarify, our image encoder utilizes the SELF-SUPERVISED CTransPath backbone to extract histopathology image feature representations at the tile level. For answers to the remaining questions, please refer to our response in the other thread.

https://github.com/hms-dbmi/CHIEF/issues/21#issuecomment-2354153604

Regarding data contamination, the first downstream task was validated using multi-center external datasets, none of which were involved in the pretraining phase. For the other downstream tasks (e.g., genomic profiling and prognostic predictions), we also validated with different external datasets. In cases where TCGA data may have been involved, the tuning labels were entirely different. As always, happy to chat. Opinions are welcome.

Eshay14 commented 2 months ago

@Dadatata-JZ I think its quite clear from my question that I am not referring to your ROI image encoder which is CTransPath, I am referring to your slide level encoder which was indeed trained in a weakly SUPERVISED manner with class labels. You explicitly state that you used TCGA slides with a supervised label for training your slide level encoder which is then again used for a lot of downstream tasks.

Would you care to explain how the class labels were different for the origin prediction task? You say that you use the 'anatomic site' during the training of your slide level encoder (https://github.com/hms-dbmi/CHIEF/issues/18#issuecomment-2347929387), lets say for for lung tissue this would be LUNG when pre-training the slide level encoder. Downstream, for the origin prediction task the same lung slide would be labeled as LUNG since you have no metastatic cases in your dataset. In my humble view, this is vanilla data contamination/leakage.

Dadatata-JZ commented 2 months ago

@Eshay14 Absolutely! Your understanding of the encoder is mostly correct. However, please note that the texts are inputs during the pre-training phase, not ground truth labels. The supervised labels are binary: 'cancer' and 'non-cancer.'

During inference (tumor origin prediction, EXTENDED FIGURE 1), these texts are no longer passed into CHIEF. Following evaluation standard in computational pathology, we reported the held-out (internal) set from TCGA for reference. CPTAC serves as the external, independent validation set, which was never used in the pre-training process.

Cheers,

tranh215980 commented 2 months ago

To clarify, our image encoder utilizes the SELF-SUPERVISED CTransPath backbone to extract histopathology image feature representations at the tile level. For answers to the remaining questions, please refer to our response in the other thread.

@Dadatata-JZ answer on ROI encoder was not related to your (@Eshay14) and my core issues but I have small problems here because CTransPath also trained on TCGA and PAIP which should be made clear. There is no label leakage but does not mean CTransPath is free of issue just because self-supervised. UNI paper released in 2023 studied issue of SSL trained and test on TCGA. It is not label leakage but after UNI paper it is clear TCGA SSL models should not evaluate TCGA prediction tasks. @Dadatata-JZ do you agree?

@Eshay14 I believe this is problematic but not severe. From my literature searches I find it was hard to find diverse data to pretrain and @Xiyue-Wang deserve achievement for releasing good model in 2022 and I do not mean to bring this fact up to create critique. However @Dadatata-JZ implies that SELF-SUPERVISED CTransPath is "contamination-free" and I cannot abide. Without TCGA there are still 50K slides that could be used for pretraining and authors should have taken chance to pretrain without TCGA. @Xiyue-Wang @Dadatata-JZ do you agree?

tranh215980 commented 2 months ago

@tranh215980 perhaps we can help each other. As far as I can understand, the authors pre-trained on a lot of public data not in a self-supervised manner but in a weakly-supervised manner USING a label for each slide. And then a lot of the SAME DATA was used for downstream evaluation tasks using fine tuning. I hope this is not the case, as it would unequivocally be data contamination, but from the information we have available in the paper and here this is what it seems like.

Would you care to explain how the class labels were different for the origin prediction task? You say that you use the 'anatomic site' during the training of your slide level encoder (https://github.com/hms-dbmi/CHIEF/issues/18#issuecomment-2347929387), lets say for for lung tissue this would be LUNG when pre-training the slide level encoder. Downstream, for the origin prediction task the same lung slide would be labeled as LUNG since you have no metastatic cases in your dataset. In my humble view, this is vanilla data contamination/leakage.

Dear @Eshay14 thank you for helping clarify this issue. On core issue of CHIEF pretraining distribution I agree there is maybe label leakage in "tumor origin" task but ask @Xiyue-Wang and @Dadatata-JZ to clear our understanding.

In recent response @Dadatata-JZ say they pass "cancer" and "non-cancer" for pretraining. I imagine this is like a CLIP loss that is added on top of SCL-WC. @Dadatata-JZ , is my understanding right?
In #21 "...During pretraining, text features (anatomical sites, e.g., lung, breast, brain) and image features are fused to facilitate slide-level classification...we performed slide-level positive (cancerous) and negative (non-cancerous) classifications". This mean pretraining and model optimization is done on TCGA data using SCL-WC+CLIP setup with both tumor presence AND anatomic site label? Is my understanding right?
True that text is no longer passed to CHIEF? In the codes, I see that text is being fused with last layer of MIL.
It is maybe true that text no longer passed to CHIEF during inference but anatomic site was still used during pretraining. Because "tumor origin" prediction task has only has tumor primary then this task is just site prediction task. In my screenshot you see that labels are also just anatomic site names. Is our understanding right?

amy-galon commented 2 months ago

@tranh215980 I am also following this closely, here is what I think: A. CTransPath features are used as the patch level encoder (SSL trained). B. CLIP is use as is to extract text feature from anatomic site C. The supervised classification problem posed is Tumor vs Normal (supervised training). This issue is that we still don't fully know. This was never really clear from the methods in the paper or there would not be this many questions. Despite the authors claiming in the paper that the code will be fully released the training code is still not public for the community to introspect. There have been so many questions about what exactly the authors did the only clear way to probe is by looking at the code and reproducing their results. Perhaps a note to the editor would compel them to release the code, since most data is public reproducing results should clarify a lot.

I completely agree with you @tranh215980 that @Xiyue-Wang needs to be fully praised for releasing CTransPath (Medical Image Analysis, 2022) before any other model was public, but I do have some serious concerns about this study which is very different in terms of design and rigor from the CTransPath paper. While I think @Eshay14 is perhaps a bit assertive in their long post (#23) they do have some very valid points, and the authors don't have clear answers to the most important aspects of their comments. I agree that there is data leakage. I am not sure if people are also following comments posted on PubPeer (https://pubpeer.com/publications/C8CFF9DB8F11A586CBF9BD53402001#5) but someone indicated that the same scenario used here was previously studied by @Xiyue-Wang in an ICLR 2023 paper (https://openreview.net/pdf?id=01KmhBsEPFO) where they concluded that data leakage across pre-training and downstream tasks contributes to enhanced model performance. In their own words: "One should pay attention not to exposing test datasets for the development of a feature embedding model, although no labels are used. For example, if the test set of CAMELYON16 is used by MoCov3 for feature embedding pretraining, we can achieve 0.9885 AUC classification performance on the test set with CLAM-MB. This exceptionally high performance is caused by data leakage." @Xiyue-Wang isin't the same scenario used in CHIEF?

hms-dbmi / CHIEF

pretraining distribution #20