mahmoodlab / CONCH

A vision-language foundation model for computational pathology - Nature Medicine
Other
195 stars 16 forks source link

Why my zero-shot classification results on CRC-100K are much higher than the results reported in your paper? #5

Closed Lewislou closed 3 months ago

Lewislou commented 3 months ago

Hi,

I eveluate your model on CRC-100K test set using the provided jupyter codes, but the balanced accuracy and F1-score are much higher than the results you reported in supplementary material. Why? Did I do anything wrong? here are my results: image

That's your results in Supplementary Matrial: image

Lewislou commented 3 months ago

Hi,

I also tried the TCGA-NSCLC dataset and Camelyon16 dataset for zero-shot classification using your published model weights. On TCGA-NSCLC subset (75 LUAD/ 75 LUSC), the subtyping accuracy is super high, which is about 0.95 on balanced accuracy (Close to supervised method). However, on the test set of Camelyon16 (129 WSIs), the tumor classification result is very low, which is about 0.604 on balanced accuracy. Why is that?

Did you use TCGA-NSCLC dataset or CRC-100K dataset during the training of the released CONCH model?

fedshyvana commented 3 months ago

Hi, it looks like you are using prompt ensembling (i.e. by default, multiple classnames / templates are ensembled for each class), which means you should look at Supplementary table 1 - 7 for zero-shot performance that we reported in the paper. The screenshot you showed, Supplementary Table 13, is for without prompt ensembling (i.e. sampling a single prompt) and is therefore lower.

It is expected that the zero-shot performance on C16 would be low since MI-Zero was designed for subtyping problems but is not well suited for some other problems such as tumor detection (e.g. positive vs. negative). This is due to the nature of the top-k pooling in MI-Zero (which aggregates e.g. the top-k highest similar tiles for each class). In a positive slide, there will still be both positive and negative tiles, and therefore similarity scores for both classes might be equally high, making top-k pooling ill-suited for guiding the slide-level prediction. We briefly discussed this limitation of the naive MI-Zero algorithm in the discussion, and note that modifications to the pooling function may help extend the zero-shot utility of MI-Zero to these types of problems but leave it to future work to carefully validate.

Lastly, just to clarify, we did not use any of the downstream evaluation benchmarks in the training of the released CONCH model.

Lewislou commented 3 months ago

Hi,

Thanks for your quick response. My questions are fully solved.