Several Major Issues with Methods, Experiments, and Analysis

Eshay14 commented 2 months ago

Apologies this is a long post about this Nature article, I have been reading the paper since it come out and I have MANY questions, too many to list in a single post. For some context I have been closely following the foundation model revolution in pathology over the past year and have been using these models for specific tasks and benchmarking. This paper is clearly an outlier over all other foundation model studies published in nature and nature medicine, I was extremely puzzled by their methods, reporting of results, comparative analysis, fundamental baselines and just how experiments are conducted. Given how hot this topic is, and given that this work made it to Nature, I assume many in the community are already trying to reproduce the results and understand what the authors did in contrast with all the other foundation models out there. After scratching my head for several days and looking at all the other issues the community is raising over GitHub, here are some concerning aspects I found. I will limit my comments to those that can be backed by publicly available information and data, and would appreciate if others in the community could help me understand if I am missing anything.
- Unclear training architecture and no evaluation of raw features: From the methods text its extremely difficult to figure out exactly how the model was trained, a lot of details are left vague, and from the responses on GitHub it seems the authors still don’t explicitly clarify how the model was actually trained (see the authors responses to many questions on this forum). It appears the authors did not train a self-supervised patch encoder on the entire 60k slides (is this correct?). They contrast with ‘anatomic site’ as text (it seems without using CLIP?, is this true?). They used weakly supervised learning (ABMIL? which keeps tiles mutually exclusive) and claim in the main text of the article that previous methods work “without considering the interactions of different regions of the same tissue” which they argue CHIEF solves. Unless I miss something here, CHIEF does not encode any interactions across patches and this statement is incorrect?. Also, this form of supervision by using the ‘anatomic site’ followed by predicting the ‘origin’ from just primary cases as a downstream task seems to be equivocal (see comment below). There are limited ablation studies in the paper to see which components of the proposed architecture work and which ones are redundant, but I am confident that the community will conduct more detailed analysis and report back (provided the training code is clearly released by the authors, which I hope would be the case if they stand by their results). Also, if I understand correctly, the slide pre-training stage is NOT evaluated at all, and all downstream tasks are from fine-tuning (is my understanding correct?). There are multiple issues on GitHub asking these questions already and the authors seem to give very vague or limited answers. Perhaps someone could compare how good raw features are from CHIEF without fine-tuning to close the loop here and if the complex architecture proposed here is of any actual value? - Missing and carefully hidden comparative analysis: The authors claim in their GitHub and in responses to several issues raised that they submitted the paper early 2023 (https://github.com/hms-dbmi/CHIEF?tab=readme-ov-file#Update) and none of the other methods existed at the time to conduct formal comparisons. I find this to be an extremely weak and odd argument. First, the paper was submitted in November 2023 hardly early 2023 (https://www.nature.com/articles/s41586-024-07894-z#article-info). Second, the authors DID INCLUDE comparisons with other models including REMEDIS etc. and delegated them to the supplement (?). In fact the authors repeatedly claim in the supplement that their model does significantly better than REMEDIS. In a recent GitHub issue in response to someone questioning about how the comparisons were made one of the authors said, “We simply copied the results and compared the MSI mutation task performance to meet the reviewers”, this response was quickly edited by another author but remains in GitHub history. This means that the authors were asked by the reviewers and had full opportunity to include appropriate comparison with recent foundation models but decided to conduct extremely vague experimentation and delegate those results to the supplement? There is no detail in the paper as to how these comparisons were conducted, just showing that CHIEF does better, which they now refute in their GitHub update issued last week saying that don’t use CHIEF as it’s not a rich SOTA feature extractor (https://github.com/hms-dbmi/CHIEF?tab=readme-ov-file#Update). Well, which one is it, did the authors show the reviewers comparisons indicating that CHIEF does better to get the paper through, but this was not the case? The authors acknowledge that they just took the numbers from the articles clearly indicating that it’s not an apples-to-apples comparison (https://github.com/hms-dbmi/CHIEF/issues/18). How exactly can it be a fair comparison if the data is different? Taking numbers from another paper and doing no diligence that the comparison is fair, splits are same, hyper-parameters are consistent is seriously problematic. REMEDIS model weights were available all this time and the authors could have clearly used them. Literally every foundation model article submitted and published before and after has made appropriate comparisons, in fact I would argue that comparative analysis is at the very core of any foundation model development and study. GigaPath - also a slide level foundation model, submitted the same month and published in the same journal made appropriate comparisons with HIPT (https://www.nature.com/articles/s41586-024-07441-w). Is there a reason the authors shy away from making such appropriate comparisons? Well, the proof is in the pudding, running basic experiments to evaluate the quality of embeddings reveals that CHIEF does significantly worse when compared to other models, made available before the authors submitted their paper or while it was in review. I am confident that others in the community will run these experiments and report back similar results. - Incorrect tumor origin prediction task, it’s just predicting the tissue site: The authors incorrectly claim that they are predicting the origin of the tumor as they ONLY use primary cases from the TCGA and no metastatic cases. To the best of my understanding, the presence of both primary and metastatic cases for each class in the training set forces the model to learn from the tumor regions (i.e. common morphology in both primary and metastatic cases) predicting the origin. If metastatic cases are removed from this process the model will learn a ‘shortcut’ from the normal tissue and will just be predicting the anatomic site the tissue was sampled from (the authors also contrast with 'anatomic site' during training?). The authors also incorrectly state that TOAD only uses TCGA cases in a recent GitHub issue, I explicitly checked the TOAD Nature 2021 paper and just like molecular origin prediction assays they combine primary and metastatic cases for each class. We can also get the TOAD authors to weigh in here, although being from the same institution they may be biased. Given the authors are just predicting the anatomic site, and they use the same information during training this whole part of the paper is largely, in my opinion, incorrect. Again, it’s only a matter of time, others in the community will reach the same conclusion. - What is the relationship to SCL-WC: The study seems a data and some architectural extension of the authors own SCL-WC paper from NeurIPS 2023 (https://openreview.net/forum?id=1fKJLRTUdo) the code for which has suspiciously disappeared from GitHub in the past 48 hours after people started asking questions on GitHub? (https://github.com/Xiyue-Wang/SCL-WC/ link can’t be accessed, removing code post-publication is possibly in violation of NeurIPS policy?) This was also noted by another poster on GitHub issues. I think what is likely happening here is that the authors concealed that their model is largely similar to their NeurIPS paper, given the similarity I was surprised if not shocked to see its never cited in the main paper (?). And again, no comparisons because why would they be important in such a paper? Repeated questions from a user about the paper’s relationship to SCL-WC have resulted in vague answers from the authors (see: https://github.com/hms-dbmi/CHIEF/issues/21 and https://github.com/hms-dbmi/CHIEF/issues/20). - General comment on the introduction, discussion: It seems that the fact that they did survival prediction is considered a major novelty of the article, however, in my humble opinion that’s just a downstream task which could be done with any pathology foundation model. In fact, many foundation models previously published have been subsequently used for survival prediction. TCGA survival is gimmicky, and results are often arbitrary across splits, I assume this is why other foundation models don’t include it to focus on more deterministic tasks. The authors want us to believe that the GigaPath Microsoft group could not have done survival – okay. I read the paper several times I could not figure out the contribution, is the architecture the contribution? Well, there is limited ablation, no concrete way to justify that the given architecture is better than the norm in the community. Is the large-scale evaluation the contribution? Well, literally every pathology foundation model article published before CHIEF has already done that. Are the model weights that can be used by the community a contribution? As the authors themselves point out in their recent update (https://github.com/hms-dbmi/CHIEF?tab=readme-ov-file#Update), its unlikely people would use CHIEF given they can use much more rigorously evaluated GigaPath for WSI-level encoding and GigaPath, Virchow 2 or other models for patch level encoding. I do apologize if I misunderstood some of these points and I would be glad if the authors could prove me wrong. However, in my opinion, the field is hot enough that the ML community interested in pathology will conduct its scientific forensic analysis to unravel every minor detail and experiment in the study. As I largely work in a comp bio/genomics group, we are asking more tenured digital pathology groups to weigh in to further refine our understanding of this work and will report back with our findings in the coming days. In the meantime, if anyone has run more detailed experimental analysis around CHIEF please post here.

### Tasks
- [x] TODO:

Xiyue-Wang commented 2 months ago

@Eshay14 Thank you for your questions. Any questions are welcome.

We are happy to see the growing interest within the computational pathology community, as it gains significant traction. We have some ongoing extended investigations collaborating with experts in computational biology, particularly in cutting-edge areas like spatial transcriptomics. These multimodal approaches are pushing the boundaries of what can be achieved through joint efforts between computational and biological sciences.

CHIEF is one of the back-to-back papers along with fellows from other independent groups developing and investigating pathology foundation models (e.g., GigaPath, Virchow, UNI). Specifically, we would like to elaborate on a few of your questions:

-For the weakly-supervised pretraining model, we performed slide-level classifications to distinguish between cancerous (positive) and non-cancerous (negative) slides. We employed our previously developed CTransPath as the self-supervised patch encoder. As noted in the paper, our self-supervised patch encoder was pretrained on 15 million patches.

-For the downstream tasks (e.g., genomic profile or prognostic prediction), we employed fine-tuning methods. However, for the cancer detection task, we used CHIEF to directly infer from raw features without fine-tuning, applying CHIEF to 15 unseen datasets.

-We added a comparison with UNI and REMEDIS as they addressed the same clinical tasks. We observed better performance in the MSI and IDH prediction tasks compared to the results they reported. While this comparison was solely based on documented outcomes—without executing the code from the compared methods—it remains meaningful. Even with significant differences in implementation approaches and settings, comparing our results with those reported in the literature provides valuable insights into the relative effectiveness of different methodologies under similar task conditions. We did not make comparisons with HIPT, as it is also a patch-level model but uses a significantly larger patch size (4096x4096).

-The texts (anatomical sites) are inputs during the pre-training phase, during inference (e.g., tumor origin prediction), we no longer use the text branch.

-In CHIEF, weakly-supervised pretraining model employs the SCL-WC architecture for image encoding and extends with an additional text branch (i.e., CLIP). During pretraining, text features (anatomical sites, e.g., lung, breast, brain) and image features are fused to facilitate slide-level classification. One can find the reference in the Method section of the supplementary file.

-For encoding features at the patch level, larger foundational models like UNI, GigaPath, and Virchow could also be considered. CHIEF and GigaPath are recommended as baseline models for studies focusing on WSI-level feature representations.

-Text embedding features are stored during pre-training. Download it at https://drive.google.com/file/d/1ZxtWgYPk95y2hfKFXj_NybypTb9vB6e_/view?usp=sharing

We understand there are many subjective critics in your note. We're happy to walk you through CHIEF in detail to see how you would like to employ foundation models in your work. Based on the feedback, many groups focused on digital pathology globally have been using CHIEF successfully without our guidance for their analytical tasks. Please feel free to reach out to us via email or set up a Zoom meeting at your convenience. We would be happy to discuss your objectives and any questions about our studies in more detail.

amy-galon commented 2 months ago

As I am studying and reproducing results in this study it is becoming clear that @Eshay14 is absolutely correct regarding the tumor origin prediction task being incorrectly posed in the CHIEF paper, @tranh215980 raised the same issue here https://github.com/hms-dbmi/CHIEF/issues/19#issue-2520239717 As @Eshay14 had proposed it would be great to get the authors of the original TOAD Nature 2021 paper to weigh in on this. @fedshyvana @Richarizardd would you care to comment? Or as @Eshay14 suspects being from the same institution as the CHIEF authors you are too conflicted to speak out?

Incorrect tumor origin prediction task, it’s just predicting the tissue site: The authors incorrectly claim that they are predicting the origin of the tumor as they ONLY use primary cases from the TCGA and no metastatic cases. To the best of my understanding, the presence of both primary and metastatic cases for each class in the training set forces the model to learn from the tumor regions (i.e. common morphology in both primary and metastatic cases) predicting the origin. If metastatic cases are removed from this process the model will learn a ‘shortcut’ from the normal tissue and will just be predicting the anatomic site the tissue was sampled from (the authors also contrast with 'anatomic site' during training?). The authors also incorrectly state that TOAD only uses TCGA cases in a recent GitHub issue, I explicitly checked the TOAD Nature 2021 paper and just like molecular origin prediction assays they combine primary and metastatic cases for each class. We can also get the TOAD authors to weigh in here, although being from the same institution they may be biased. Given the authors are just predicting the anatomic site, and they use the same information during training this whole part of the paper is largely, in my opinion, incorrect. Again, it’s only a matter of time, others in the community will reach the same conclusion.

hms-dbmi / CHIEF

Several Major Issues with Methods, Experiments, and Analysis #23