hms-dbmi / CHIEF

Clinical Histopathology Imaging Evaluation Foundation Model
GNU Affero General Public License v3.0
215 stars 28 forks source link

CHIEF pretrained features are worse than random initialization or simple mean pooling #24

Open amy-galon opened 1 month ago

amy-galon commented 1 month ago

We are following the concerns being raised about this study both publicly on this forum (#23, #20, #21), on pubpeer (https://pubpeer.com/publications/C8CFF9DB8F11A586CBF9BD53402001), and privately. Most concerns are around how the authors trained their model (#20, #21, #23) and downstream evaluation (#23). While many of these are serious and valid concerns, we wanted to assess the overall representation quality of CHIEF embeddings and its downstream performance. In our quantitative analysis on two separate datasets and classification tasks (binary and multiclass) we find that CHIEF slide embeddings are largely meaningless and even the mean of all patch features or random initialization performs better. In our critique we will remain as objective as possible, back everything with explicit analysis and explain our experimental process in detail. We openly invite others in the community to conduct similar analysis and join forces in parsing this study.

CHIEF embeddings underperforms mean pooling and random pooling

To assess the representation quality of pretrained CHIEF embeddings, we compared with the following baselines: Mean Pooling: Simply taking the average of all patch features in the bag and use it as the slide feature. Random Pooling: Passing patch features through a randomly-initialized ABMIL model, with its output used as the slide feature. While the authors repeatedly argue that at the time of submission there were no other models available for comparison (https://github.com/hms-dbmi/CHIEF/issues/19#issuecomment-2345081445) we believe that everyone would agree that mean pooling and random pooling are appropriate baselines and if CHIEF can't outperform the mean of all patch embeddings or random init. what is the purpose of the complex method proposed in the paper?

All pre-extracted slide features from CHIEF (Pretrain), mean pooling, and random pooling were evaluated via linear probe (logistic regression) following current practices in the self-supervised learning community (https://arxiv.org/abs/2304.07193). We evaluated two tasks: (1) TCGA-BRCA (5-fold site-stratified splits), and (2) EBRAINS (30-class brain tumor subtyping, using the same train-validation-test splits from UNI, these splits have been commonly used by many studies since the UNI authors released them). All comparisons used the same bags of pre-extracted CTransPath features (extracted at 10X via the CLAM pipeline, which was also used by the authors). We used the CTransPath model weights from this GitHub and not the old checkpoint. Overall, we report that slide pretraining in CHIEF is worse than mean pooling and random pooling.

TCGA-BRCA (binary) ERAINS (30 class)
ROI Encoder MIL Architecture Pretrain Method Evaluation AUROC Bal. Accuracy
CTransPath ABMIL CHIEF Linear Probe 0.905 0.542
CTransPath ABMIL Mean Pooling Linear Probe 0.933 0.592
CTransPath ABMIL Radom Pooling Linear Probe 0.928 0.647

CHIEF finetuning underperforms random initialization

To assess the effectiveness of finetuning capabilities of CHIEF, we also compared CHIEF with a randomly initialized ABMIL. We used the same hyper-parameters and code implementation as the repository. We used the same tasks and splits reported above. We report that CHIEF finetuning also underperforms random initialization.

TCGA-BRCA (binary) ERAINS (30 class)
ROI Encoder MIL Architecture Pretrain Method Evaluation AUROC Bal. Accuracy
CTransPath ABMIL CHIEF Finetune 0.884 0.549
CTransPath ABMIL Random Init. Finetune 0.913 0.567

Comment

Give that the authors already published CTransPath (Medical Image Analysis) as an ROI level encoder, if CHIEF slide encoder can’t outperform the mean of all patch embeddings or random init., it heavily undermines and negates the value of the pretraining paradigm proposed in the study, given that it required 60k slides with corresponding labels. We invite the authors @Xiyue-Wang @Dadatata-JZ @khyu and anyone else to verify our results feel free to ask questions. In particular it would be great to hear from the senior author @khyu if they still stand by their approach and study in light of this analysis and previous concerns raised by the community.

Comment to the community

We are aware of multiple efforts to conduct similar analyses, in our opinion publicly posting concerns and results allows the authors to respond and the community to vet our analysis rather than direct and immediate communication with nature editors, rest assured at this point they are already aware and monitoring the situation. If anything, we propose joining forces to compile concerns, conduct a thorough and fair analysis while giving the authors the opportunity to respond. In the coming days we plan to continue this analysis, compile all concerns and post publicly before going back to the EIC.

Dadatata-JZ commented 1 month ago

@amy-galon

Hi Amy, Thanks for sharing this. I quickly went over your type-up posted last night. The results are quite interesting. Here are some of my initial thoughts, and we can continue the discussion as needed.

First and foremost, while these two tasks may have been evaluated elsewhere, neither is included in CHIEF's presentation.

Some implementation details also don’t fully align, and since we don’t use logistic regression for classification tasks in CHIEF, I’m afraid that I can't comment on that aspect. We are quite familiar with EBrains and its sample distributions; implementation variations or perturbations may need to be noted. Personally, I feel CNS tumors are particularly complex and require substantial efforts from all of us to investigate more ;)

That said, these shouldn’t discourage you and us from potential joint exploration. We’d be happy to discuss your experimental setups for these extended validations if you’d like to post your pipe somewhere.

As always, we’re open to discuss in depth and providing support, even during off-hours for maintenance. Drop us an email.

Cheers

amy-galon commented 1 month ago

Random Initialization Outperforms CHIEF Features - part 2

@Dadatata-JZ here are results on your tumor origin prediction* task which you report in your paper using finetuning, clearly showing that random initialization OUTPERFORMS CHIEF features. We will continue to run this for all of your experiments, in fact feel free to tell us what experiment to run that would convince you that CHIEF slide features are meaningless and your pre-training strategy makes no sense given average of all features or random initialization does better?

Origin prediction* Origin prediction*
ROI Encoder MIL Architecture Pretrain Method Evaluation AUROC Bal. Accuracy**
CTransPath ABMIL CHIEF Fine-tune 0.9951 0.8846
CTransPath ABMIL Radom Init. Fine-tune 0.9953 0.9001

The CHIEF results are from our reproduction, for fairness we used your codebase to do the finetuning. More results incoming soon, stay tuned. Again, we invite anyone to verify. We would love for the authors to refute our findings with evidence.

*The 'tumor origin' task reported by the authors is incorrectly posed in the CHIEF paper, this was explained well in (#23) and previous issues. Since the authors don't use any metastatic cases the model is just learning the tissue site. As #23 points out we would love for TOAD authors @fedshyvana @richarizardd to comment on this. ** A multi-class classification problem should always report balanced accuracy if AUC is reported, for an 18 class problem one vs all AUC will always be very high, something the authors don't report in their paper.

amy-galon commented 1 month ago

Random Initialization Outperforms CHIEF Features - part 3

@Dadatata-JZ @Xiyue-Wang @khyu here is an additional example of a task 'included in CHIEF's presentation', where we clearly show that random initialization outperforms CHIEF features using fine-tuning exactly how you present results in your article.

Task: IDH mutation prediction (including in the CHIEF study), comparison between using CHIEF features and just random initialization shows random initialization outperforms CHIEF. Train on TCGA, Test on TCGA held out and EBRAINS as independent set.

TCGA GBM+LGG IDH EBRAINS IDH
ROI Encoder MIL Architecture Pretrain Method Evaluation AUROC AUROC
CTransPath ABMIL CHIEF Fine-tune 0.9118 0.9264
CTransPath ABMIL Radom Init. Fine-tune 0.9128 0.9321

Perhaps now the authors could give a clear answer, what do you believe is the value of CHIEF slide features, and your complex pre-training approach given one could just randomly initialize and get better results for actual experiments presented in your Nature article?

Dadatata-JZ commented 1 month ago

Amy, @amy-galon

Thank you for your enthusiasm. Due to time differences, we may not be able to catch up promptly, but surely do our best!

These numbers look interesting, for instance, your numbers from simple tests seem already higher than the published knowledge. IMO, you may consider publishing this somewhere (e.g., IDH prediction, you easily crashed 0.91 compared to other existing publications)!

However, since the numbers are not aligned with our experiments, to navigate this evaluation better, feel free to post your codes, experimental setup, data processing and partitions online. We can discuss from there with more context.

Have a good night.
Cheers,

amy-galon commented 1 month ago

@Dadatata-JZ I am available and working in the PT time zone. Please post your arguments publicly so others can also see how CHIEF authors are responding to these critiques. We are running additional experiments you did in your Nature article and comparing with random initialization so far every experiment indicates that there is no value to CHIEF pre-training, slide features and the pre-training architecture was incorrectly designed. We intend to make all of our findings, code, models public.

As to your point, we disagree, unfortunately your statement is not correct, the numbers for IDH mutation are not unusual at all, please see below a screenshot from the UNI (path encoder) supplement, this is their result using the CTransPath (Medical Image Analysis, patch encoder) has a comparable result. So the results are reasonable, they just prove our point that that CHIEF pre-training (slide level) and corresponding features are of no value in light of your own claims in the paper and corresponding experiments. any_expert

CTransPath (patch encoder) from @Xiyue-Wang (Medical Image Analysis, 2022) which was already published clearly has value as you can see just like your paper we use CTransPath features in all our experiments (https://github.com/hms-dbmi/CHIEF/issues/24#issuecomment-2363405532, https://github.com/hms-dbmi/CHIEF/issues/24#issuecomment-2362544341), growing evidence suggests that CHIEF slide pre-training from your current Nature article is largely useless.

@Dadatata-JZ we would love to have an explicit answer to this question: Why CHIEF pre-training would be useful for any of the experiments you did in the paper if one could just initialize randomly and get a better or comparable result?

@Dadatata-JZ I see that consistent with other posts you skipped commenting on the tumor origin prediction task, do you now acknowledge it was incorrectly posed in your article?

Sami-Ravi commented 1 month ago

@amy-galon Actually it is not necessarily true. Hi, @Dadatata-JZ I am also working on some classification tasks using CTransPath. I used the CHIEF model weights and codes, then extracted the frozen WSI-level features. I found that CHIEF pre-training outperforms the mean pooling method (see my Tables 1 and 2).

Experimental Setup: First, I extracted the CTransPath features for each dataset. In Tables 1 and 2, I report results for two methods: using CHIEF features and mean pooling. For the CHIEF approach, I obtained the frozen slide features using the Get_CHIEF_WSI_level_feature_batch.py, and then used classifiers such as random forest, logistic regression, and SVM to obtain classification results. For the mean pooling approach, I used mean pooling to generate slide-level features and applied the same classifiers.

I tested these methods on three datasets: BCNB (2-class), PANDA (6-class), and Camelyon16 (2-class). For BCNB and Camelyon16, I used the official data splits on their websites (Camelyon16: [https://camelyon16.grand-challenge.org/Data/], BCNB: [https://bcnb.grand-challenge.org/]). For PANDA, I did a 7:1:2 data split for training, validation, and testing sets. My results can be seen as follows. Feel free to validate that since these are all public datasets.

@Dadatata-JZ How do you see these differences between different results?

fig1

tranh215980 commented 1 month ago

Dear @amy-galon @Sami-Ravi ,

Thank you for analyzing the problems. It is great to see more people studying these problems. So far I am in middle and find performance variable. My progress:

@Sami-Ravi CHIEF include PANDA and BCNB in training and suffer from data contamination. We discuss in #20 . @amy-galon mean pooling comparison is worrisome . @Sami-Ravi show everyone need to test further. Also important is showing random-init ABMIL as baseline. For same cross comparison:

Dear @Dadatata-JZ,

There are many issues but thank you for spending time to reply and support us in joint exploration. I raise issues and critics but I am also fair. As we share findings can you also support us by making splits and clinical metadata public for TCGA+CPTAC so we can test biomarker prediction and survival prediction? This allow us align with your setting and I hope you fair too.

You also say something in previous message when ask about CHIEF performance with linear probe:

Some implementation details also don’t fully align, and since we don’t use logistic regression for classification tasks in CHIEF, I’m afraid that I can't comment on that aspect.

However you make comment in README that say:

👑👑👑 Encoding one WSI as one feature representation.

Many downstream clinical applications (e.g., survival analysis, drug discovery, and the identification of unknown subtypes via unsupervised clustering), rely on encoding a single feature that effectively represents an entire slide. Therefore, in addition to patch-level (region-of-interest) encoding, CHIEF also focuses on whole slide image (WSI)-level embedding without fine-tuning the tile aggregator. Docker images (model weights) are available at https://hub.docker.com/r/chiefcontainer/chief.

@amy-galon @Sami-Ravi do we agree? CHIEF advertised as slide foundation model and linear probe is standard protocol for image encoders. In pathology, all ROI foundation models evaluate with linear probe on frozen features. Slide foundation model should do the same and minimum bar for CHIEF need to be higher than outperform mean pooling @amy-galon. Not only CHIEF finetune but also CHIEF linear probe need to be better than ABMIL from scratch.

Eshay14 commented 1 month ago

@Sami-Ravi @amy-galon @tranh215980 I have just finished running my set of experiments, my sense is that CHIEF does worse than random and mean features a majority of the time and barely works for a FEW of the datasets that the pre-training was conducted on. Like the authors made clear that its pre-trained on PANDA (https://github.com/hms-dbmi/CHIEF/issues/18#issuecomment-2345062265) so it does reasonably on that dataset. @Xiyue-Wang never responded to @amy-galon's very valid and important question here, why don't the authors think that there is data leakage in CHIEF if the same datasets are used for pre-training similar to what @Xiyue-Wang described in her ICLR 2023 article?

@amy-galon I disagree with you, look at the responses from the authors, do you believe we will ever get a straight answer, we are contacting the editors because there are too many issues with this article, a) data leakage (pre-train and assess on same data) b) incorrectly posed experiments (tumor origin without mets?) c) carefully concealed comparative analysis, at least we can ask the editor to check the peer review file did the authors ever tell the reviewers that their comparison with REMEDIS is not fair and just taken from the paper? They likely used this to convince the reviewers without telling them the details. d) mean, random features are better than CHIEF in a majority of cases, in fact @Dadatata-JZ himself acknowledges that there is no need to use CHIEF beyond what its trained on, see above, also see now deleted comments from @Xiyue-Wang asking people not to use CHIEF saying its not state of the art yet claiming in the paper its better than REMEDIS? Again to cheat the review process. e) the results reported for PORPOISE are wrong after I read the comment on pub-peer and tried to reproduce (will create separate issue) I always found PORPOISE with resnet and REMEDIS features to do better than CHIEF.

tranh215980 commented 1 month ago

Dear @Eshay14 ,

What benchmarks did you test? I would like to know. I am using CLAM to extract features with tcga.csv and then test using scikit-learn.

Dear @amy-galon @Sami-Ravi @Eshay14 ,

I have been testing more. I point out that author's CTransPath patch features for test set in TCGA origin prediction task, MUV IDH1 task, DROID task, and "Dataset_PT" task can be taken from docker container. I do not find CHIEF always do worse than mean pooling results can be close. Because my original patch features are different than CHIEF, I share some results using leave-one-out on their features so we can all do same comparison.

  1. My settings:
  1. My results:
Dataset Model BACC AUC
MUV CHIEF & LogReg 0.865383 0.930296
MUV Mean Pool & LogReg 0.864102 0.917949
DROID CHIEF & LogReg 1 1
DROID Mean Pool & LogReg 0.990619 0.999948
Dataset-PT CHIEF & LogReg 0.996350 0.999798
Dataset-PT Mean Pool & LogReg 0.984016 0.999616
  1. Compared to original:

(a) IDH1 performance on MUV is close to Supplementary Table 25. CHIEF finetune is 0.870 BACC and 0.944 AUROC. (b) For DROID I maybe I do something wrong and I use wrong subject id column to leave out. My AUC is close to their 0.981 AUROC in Figure 1 which show CHIEF finetune performance. (c) For Dataset-PT I maybe I do something wrong and I use wrong subject id column to leave out. My AUC is close to their 0.994 AUROC in Figure 1 which show CHIEF finetune performance. One more fact the AUROC for other comparisons in Figure 1 for Dataset-PT is lower than 0.916. My feeling is baseline in paper are shocking low.

My setting is not perfect but good to compare on same features. @amy-galon where did you get EBRAIN splits in UNI? How can we enter joint collaboration together? My personal email is tranh.nguyen.dp@gmail.com . @Eshay14 if you have paper reviews can I also see?

Sami-Ravi commented 1 month ago

@amy-galon @Eshay14 My new results disagree with yours. I further did some comparison between CHIEF WSI and the mean pooling method. You guys can find my results (Table 3-6). BTW, linear probe can obtain similar results. As you can see, the performance of the frozen CHIEF features is significantly higher than the corresponding mean pooling method.

fig4 fig3 fig2 fig1

I have different takeaways about the pretraining data. While PANDA and BCNB were used for pre-training in CHIEF, I used different biomarker labels from the pretraining tasks. Also, Camelyon 16 was never used in pre-training. I think my results are valid.

Hi, @tranh215980, you could further check my features and method of obtaining WSI level features. I have uploaded the frozen slide level CHIEF features here, following here. Feel free to use them. I like your idea about cross validating. I currently don’t have EBrains data downloaded. I could move to them later once I got the data access permit.

tranh215980 commented 1 month ago

Dear @Sami-Ravi ,

Thank you for effort. I do not fully agree yet CHIEF worser than mean pooling until more evidences comes. I have found evidence which shows improvement is small. How can GigaPath also be like this? These are interesting but we cannot do full reproduce from scratch if only have frozen slide features as pt. Even with frozen patch features from pt we have no coordinates. Because my trust in this study is low I need stronger evidence now and bar should be higher. Only things I trust are:

While PANDA and BCNB were used for pre-training in CHIEF, I used different biomarker labels from the pretraining tasks.

I cannot agree with this. If tumor class is used for "weakly-supervised pretraining" this has some label leakage which is big concern. Label for tumor/normal correlates with Gleason score grade in Panda and Biomarker status in BCNB so we should stop these results do you agree? Camelyon16 is more interesting and I study at some time but as @Dadatata-JZ say we only look at tasks within study for full fairness . But in mean time @Sami-Ravi can you please share CHIEF model with random init and finetuning which gives baseline for ABMIL+CTransPath combo. I invite any collaboration and support in tranh.nguyen.dp@gmail.com thank you for attention.