questions about baseline + code

tranh215980 commented 2 weeks ago

Dear authors,

First thank you for your amazing works and contributions. This is one of my favorite papers to read and I have study this paper over and over again like a handbook. there many citations helping me with literature review. Now with better understanding of pathology now I have many questions and I ask everything with only objectivity:

Where is codes for ABMIL baseline? I parsed Methods many times and also see Extended Data Table 9. The model definition are missing and I dont know if I use same versions when Im making results.
@amy-galon share results using same UNI splits on EBRAIN in https://github.com/hms-dbmi/CHIEF/issues/24#issue-2535143416. Where I find to reproduce myself?
@amy-galon in same GitHub issues also show new results with mean pooling better than slide foundation model . Have you tried this yourself and if so why didnt u not compare models with mean pooling on slide tasks?
You dont compare with CONCH but they publish together why is that?
Why k-Nearest Neighbors (k-NN) and "SimpleShot" used for evaluation in only your paper? I read other foundation model papers and no one is doing this except yours. I so far reproduce your linear, k-NN, and SimpleShot results on CRC-100K and surprised how much good k-NN and SimpleShot are and is suspicious k-NN is only good for UNI? k-NN not used in CONCH either so why is that? Is there reason why "NONORM" is used instead of "NORM" for CRC-100K?
So im 100% clear: this method use only DINOv2 and no other methods? The paragraphs on iBOT dont make sense to me (issue #40 and also page 21-22). Why include when DINOv2 is discussed? This seems like roundabout way to say you include hidden tricks from iBOT?
As "foundation model (FM)" paper this has many great experiments. But their many missing spots still like SOTA is never compared and no ROI tasks with independent test cohorts. Rare disease described as motivation in introductions but never evaluated. I like writing style very much but paper is dodgy to say "this is not a FM" (see paragraph before methods on page 19) so why is that? Paper never say UNI is "FM" but it say "scaling laws are observed" which only for FMs. This claim is maybe misleading since scaling laws are for studying both data and model size. No survival results (unlike CHIEF) why is that? The authors say all comparisons with ABMIL are fair and I agree on this. But this paper lack comparison with better MILs and I feel better MIL with weaker features can overcome UNI quite easily.
This is terrible to ask but I want you on record. "OT-43" and "OT-108" tasks come from Harvard. This is private dataset with 3944 slides in train set and 1620 slides in test set. First I read test set never exposed during self-supervised learning (page 28). I mean with no disrespect but how can we trust authors when only UNI model (trained on Harvard data) does well on this task out of all models? I dont mean by accusing you and I know this paper is first to study SSL contamination in pathology but this is suspicious. Plus all UNI ablations on smaller datasets do better (Extended Data Table 13). So you are saying UNI use 1M images is greater than CTransPath and REMEDIS which has many more? Many papers like yours and Virchow show more data trains better model so why not case here?
This question is with above issue too. I read also "OT-43" and "OT-108" is only slide task with ABMIL trained with no early stopping why is that? Last you do "interesting" in OT-43 and OT-108 where you cluster-sample patches and extract patch features for each slide (page 29). These are used into MIL. Because you do on both train and test is this not data contamination? This issue is suspicious like one in CHIEF (https://github.com/hms-dbmi/CHIEF/issues/20#issuecomment-2354277364) also from Harvard lab do you know?

Sorry for many questions. This paper is very long but I still read carefully so Im having many questions. Please when you answer can you also respond straightforward? This paper is important to my understanding so I must ask tough but fair questions.

Richarizardd commented 6 days ago

HI @tranh215980 - Apologies for getting to you late, and thank you for your time and interest in studying our work. It is also humbling to see this work studied so carefully. Due to the length of this GitHub issue, I will answer each question in a separate comment. To address a recurring issue, I believe many of these questions can be resolved by reading the camera-ready version published in Nature Medicine and seeing the updated comparisons in the bottom of the README.

Richarizardd commented 6 days ago

Where is codes for ABMIL baseline? I parsed Methods many times and also see Extended Data Table 9. The model definition are missing and I dont know if I use same versions when Im making results.

Apologies that this was not clear, and we will update our README soon. Our codebases have generally adapted CLAM for weakly-supervised slide classification, and we have steered away from rehosting CLAM when used in different projects. You can find the ABMIL implementation used in UNI in our PANTHER codebase.

Richarizardd commented 6 days ago

@amy-galon share results using same UNI splits on EBRAIN in CHIEF pretrained features are worse than random initialization or simple mean pooling hms-dbmi/CHIEF#24 (comment). Where I find to reproduce myself?

Apologies that this was not clear, and we will update our README soon. The splits for EBRAINS (used in UNI) can be found also in the PANTHER code in https://github.com/mahmoodlab/PANTHER/tree/main/src/splits/classification/ebrains.

Richarizardd commented 6 days ago

@amy-galon in same GitHub issues also show new results with mean pooling better than slide foundation model . Have you tried this yourself and if so why didnt u not compare models with mean pooling on slide tasks?

We have tried mean pooling before and it can be quite strong (see Table 1 in HIPT). To reiterate on how we evaluated representation quality in UNI:

At the ROI-level, we used common and standard evaluation protocols such as linear and KNN probe.
At the slide-level, since these encoders are most often used to pre-extract ROI features in a MIL setup (which is the standard convention in solving slide-level tasks), our protocol for slide classification plugged ROI encoders into a very simple MIL baseline (e.g., ABMIL). We can think of pre-extracted ROI features from UNI and other ROI encoders as being the dense patch features in the last layer of a ViT, and ABMIL as being a linear / MLP head that we finetune on top of the dense patch features, which is adopted in other types of SSL evaluation protocols.

Taking the average of bag features can be a good baseline baseline and more people should revisit this baseline with stronger ROI encoders and esp with pretrained slide encoders. On why we didn't:

Mean pooling is a good baseline but not a standard for slide-level evaluation. It is an interesting contribution but not the best to formally introduce in a benchmark-like paper for evaluating ROI encoders, and the paper was getting quite packed. Independent of other discussions, this is an issue that we have been studying concurrently for some time. I agree with you that this should have been a baseline in recent Slide FMs.
As you may have noticed, mean pooling is sensitive to cancer detection tasks. Representations would be skewed by the ratio of tumor:normal tissue content, and I would also say that this is not the best baseline either for slide tasks.

Richarizardd commented 6 days ago

You dont compare with CONCH but they publish together why is that?

The bottom section of our GitHub repository shows CONCH comparisons (also in CONCH repo).

Richarizardd commented 6 days ago

Why k-Nearest Neighbors (k-NN) and "SimpleShot" used for evaluation in only your paper? I read other foundation model papers and no one is doing this except yours. I so far reproduce your linear, k-NN, and SimpleShot results on CRC-100K and surprised how much good k-NN and SimpleShot are and is suspicious k-NN is only good for UNI? k-NN not used in CONCH either so why is that? Is there reason why "NONORM" is used instead of "NORM" for CRC-100K?

On KNN: KNN is used in many SSL works in fundamental computer vision, and is especially used in the "DINO" family of SSL works (DINO, iBOT, DINOv2). You won't see it in works like MAE which are also vision-only SSL but do not pretrain with the [CLS] token. Because UNI uses DINOv2 for the SSL recipe, we adopted similar evaluation strategies like KNN probe and Mask2Former finetuning for evaluation (which are both used in DINOv2). Because many current pathology FMs are based on DINOv2 and a majority of utilization of pretrained ROI encoders is with frozen, pre-extracted features, I would disagree and argue KNN probe should be used more widely: (a) KNN is part of standard linear eval in DINO / iBOT / DINOv2, (b) non-parametric and very easy to validate, no significant hyper-parameters to worry about, (c) evaluates a different aspect of representation quality (does not require linear separability).

On CONCH comparison: See CONCH KNN performances in our README as well. Both UNI and CONCH are ROI encoders, but the contributions in the respective studies entailed highlighting different aspects of the models (representation quality in UNI and zero-shot capabilities in CONCH). CRC-100K-NORM is maybe more standard but I think CRC-100K-NONORM is a harder task that additionally tests robustness to domain shift.

Again, I do not think we are being unfair to other models by evaluating with KNN. In our paper, we also hypothesized that UNI features are more robust and less sensitive to stain variation, which we additionally validated on CAMELYON17-WILDS. If you see a flaw in using KNN or flaws in how the original DINOv2 evaluated, I would be interested to discuss further and hear what your opinion is.

Richarizardd commented 6 days ago

So im 100% clear: this method use only DINOv2 and no other methods? The paragraphs on iBOT dont make sense to me (issue Problems with the code for model training #40 and also page 21-22). Why include when DINOv2 is discussed? This seems like roundabout way to say you include hidden tricks from iBOT?

Yes - we only use DINOv2. We are very explicit everywhere in the paper that “UNI is a VIT-L/16 pretrained via DINOv2”. There are no hidden tricks - this paper (and others) are using DINOv2 with no or limited modifications. I believe you have been reading the arXiv version and not the official camera ready version published in Nature Medicine which includes many updated experiments that would also address your questions. We discussed iBOT since we compared ViT-L/16 DINOv2 vs. ViT-B/16 iBOT across Mass-1K/22K/100K in the camera ready version. The method section here is a bit long because there are some nuances in comparing models trained with iBOT vs. that of DINOv2, which may seem like different SSL algorithms but are closely related to each other (iBOT + several improvements training recipe and implementation = DINOv2).

Richarizardd commented 6 days ago

Many of your questions can be answered by reading the camera ready version published in Nature Medicine (not the initial version available on arXiv).

As "foundation model (FM)" paper this has many great experiments. But their many missing spots still like SOTA is never compared and no ROI tasks with independent test cohorts.

In the supplement, we included further comparisons with SOTA, leaderboard and retrospective results published by others. We do not emphasize these results because we are taking numbers as-is with very different hyper-parameters and evaluation protocol.

Rare disease described as motivation in introductions but never evaluated.

Please see the updated camera ready version. 90/108 cancer types in OT-108 consists of rare tumors. All 30/30 brain tumors in EBRAINS consists of rare brain tumors.

I like writing style very much but paper is dodgy to say "this is not a FM" (see paragraph before methods on page 19) so why is that?

I cannot fully unravel the semantics around "FM" in a GitHub issue (see previous debates on X) , but in short - most SSL works describe themselves as vision encoders. DINOv2 describes itself as a SSL vision encoder that extracts general-purpose features. UNI and others use DINOv2. The definition evolved over time and I guess shifted for pathology but in 2022/23 this terminology was not prolific at the time, and was more appropriate for describing CLIP/PaLM/Chinchilla/Flamingo-like models and not vision-centric models. Was there another paragraph that you found "dodgy"? Sentences which did mention "FM" were used to motivate the objective of the study (in working towards a general-purpose FM).

Paper never say UNI is "FM" but it say "scaling laws are observed" which only for FMs. This claim is maybe misleading since scaling laws are for studying both data and model size.

See above for studying data and model size. I would argue "observing scaling laws" does not necessitate a model being a "FM" either.

No survival results (unlike CHIEF) why is that?

Assessing representation quality is much easier on ROI classification tasks, e.g. - separating visual categories such as tumor morphology which are clearly linearly separable. Slide-level tasks such as survival prediction may be more dependent on MIL architecture design than ROI encoder. In addition, we have also assessed UNI on survival prediction tasks in follow-up works (see PANTHER).

GSoE6qmbIAAVqJz

The authors say all comparisons with ABMIL are fair and I agree on this. But this paper lack comparison with better MILs and I feel better MIL with weaker features can overcome UNI quite easily.

See Supplementary Table 36 (posted at the top of this comment, replying to your Leaderboards question) and the image above on survival prediction comparisons.

On CAMELYON16, UNI+ABMIL outperforms CTP+TransMIL and other combinations.
On EBRAINS and PANDA (Karolinska test set), UNI+ABMIL outperforms CTP+TransMIL and other combinations. On PANDA (Radboud test set), UNI+ABMIL is only outperformed by CTP+ILRA (Quad. Kappa of 0.918 vs. 0.920). I think MIL architecture is important for more complicated tasks, but do not think better MIL with weaker features will "easily" overcome simpler MIL with stronger features. I also do not assert that better MIL architectures are not needed. At the time, I found it surprising how much performance gains you could get with just simpler MIL + stronger features, and there was a gap in literature on developing better SSL encoders. There is also no reason to use UNI with an equivalent MIL model either.

Richarizardd commented 6 days ago

This is terrible to ask but I want you on record. "OT-43" and "OT-108" tasks come from Harvard. This is private dataset with 3944 slides in train set and 1620 slides in test set. First I read test set never exposed during self-supervised learning (page 28). I mean with no disrespect but how can we trust authors when only UNI model (trained on Harvard data) does well on this task out of all models? I dont mean by accusing you and I know this paper is first to study SSL contamination in pathology but this is suspicious. Plus all UNI ablations on smaller datasets do better (Extended Data Table 13). So you are saying UNI use 1M images is greater than CTransPath and REMEDIS which has many more? Many papers like yours and Virchow show more data trains better model so why not case here?

Yes, you can have me on record - we did not use the test set in OncoTree (OT)-43/108 for any pretraining purposes. Collecting the slides in this study required a lot of physical labor and involved almost every author on this study, and we are quite proud of the fact that this is the first published study to make a private foundation model for computational pathology made public for the research community.

Both OT-108 and Mass-100K were indeed curated at MGB (specifically BWH and MGB), but including the test set in OT-108 does not make any sense for us. In going through the diligence in curating giant pretraining dataset like Mass-100K, it is pointless if we ultimately “pretrained on everything” and we cannot fairly evaluate our model on a challenging benchmark. In addition:

The first to study SSL data contamination is actually Xiang et al. 2023 ICLR. This is cited.
Other models actually do well on OT-43/108 as well. See our comparisons in the GitHub README.
Not all SSL models pretrained on MGB data performed well. See experiments in the camera-ready version where we also trained MoCov3 on internal data.

Models trained on tiny datasets can certainly do well. See the original DINOv2 work. DINOv2 ViT-L on 1M images is quite strong. See how well DINOV2 ViT-L models do across IN1K, IN1K+IN22K, 142M (uncurated), and 142M (curated) images. Without data curation, your ViT-L on 142M uncurated images will underperform that of 1M images. In the original claims of DINOv2, they do not emphasize at all that data scaling is everything. Instead, having high-quality and diverse data is arguably more important (until other experiments are shown). Data scaling in vision models is simply not the same (yet) as data scaling in LLMs.

Richarizardd commented 6 days ago

This question is with above issue too. I read also "OT-43" and "OT-108" is only slide task with ABMIL trained with no early stopping why is that? Last you do "interesting" in OT-43 and OT-108 where you cluster-sample patches and extract patch features for each slide (page 29). These are used into MIL. Because you do on both train and test is this not data contamination? This issue is suspicious like one in CHIEF (pretraining distribution hms-dbmi/CHIEF#20 (comment)) also from Harvard lab do you know?

On OT-108, we performed local clustering at the patch-level of each WSI. Transductive inference means you have access to test dataset distribution at train time (this could actually be for many valid and legitimate research purposes). However, this is not transductive inference as at test-time it is reasonable to assume you have access to the entire WSI (sample). We make this assumption for all WSIs when we preprocess them (we patch and extract features for all patches in the WSI before MIL, K-means clustering can be seen as an additional image preprocessing step). In other words, we have access to all patch-level data within a sample (a single WSI) at test time but we do not assume we can access all samples (WSIs) in the dataset. If we were clustering globally (across all WSIs) and then sampling patches closest to each centroid in the WSI - then yes I would agree that this is transductive inference. Lastly, we also show results without any clustering for fair comparison. We did not perform any early stopping because many classes in OT-108 had as few as 5 slides in the train set and there was not enough data to create a validation set for early stopping.

tranh215980 commented 5 days ago

Dear @Richarizardd ,

Thank you for bringing many answers. I misjudge you and many answers are in newest paper version. Forgive my grave mistake and I can reread again.

mahmoodlab / UNI

questions about baseline + code #41