miccaiif / TOP

[NeurIPS 2023] The Rise of AI Language Pathologists: Exploring Two-level Prompt Learning for Few-shot Weakly-supervised Whole Slide Image Classification
21 stars 4 forks source link

Data split for all the few-shot experiments? #3

Open Lewislou opened 6 months ago

Lewislou commented 6 months ago

Hi ,what's your data split for few-shot experiments? For example, in one-shot or two-shot setting, how to split training/validation set?

miccaiif commented 6 months ago

Hi ,what's your data split for few-shot experiments? For example, in one-shot or two-shot setting, how to split training/validation set?

Hello! Thanks for your attention. The data split for few-shot experiments is controlled by "random seed" directly in the code, e.g. the seed in exp_CAMELYON.sh. We did not manually split the dataset because of the inconvenience. And in specific shots, the experimental settings between different methods are the same. You could benchmark the dataset using the seed by yourself, and maintain that setting to implement various comparison methods. Also, the few-shot WSI classification indeed needs a benchmark dataset, and that is our future direction.

Lewislou commented 6 months ago

Hi,

Thanks for your quick response~. I have diligently tried seeds 0, 128, 192, 111111, and 101012, but have found it challenging to replicate the high AUC results described in your recent paper. Would it be possible for you to share the specific seed or the training/testing slide names that were used for the Camelyon and NSCLC datasets? This information would greatly assist us in accurately following your implementation and could significantly contribute to extending the impact of your important work.

Eli-YiLi commented 5 months ago

The setting is not reasonable, because few-shot performance is highly related to the split. Repeat 5 times at different data split is much more standard instead of a fixed split for each run. If one good split is used, the results are hard to reproduce.

miccaiif commented 1 month ago

Hi,

Thanks for your quick response~. I have diligently tried seeds 0, 128, 192, 111111, and 101012, but have found it challenging to replicate the high AUC results described in your recent paper. Would it be possible for you to share the specific seed or the training/testing slide names that were used for the Camelyon and NSCLC datasets? This information would greatly assist us in accurately following your implementation and could significantly contribute to extending the impact of your important work.

Thank you very much for your response, and I apologize for the delayed reply.

In the challenging few-shot WSI setting, we found that the results are highly sensitive to the random seed configuration. Specifically, different selections of few-shot WSIs can significantly impact the outcomes, particularly when the chosen training samples are not representative or when there is a substantial mismatch between the selected samples and the corresponding language prompts. Unfortunately, there is currently no standardized benchmark dataset available for few-shot WSIs. In our code, we assessed the performance across various seed settings to ensure consistency, applying the same seed across all methods.

During training on the server, we used seeds to preserve the settings and recorded the results manually in a notepad, without making any manual dataset splits. I will take the time to identify a dataset split that is better suited to this method when my schedule permits.

miccaiif commented 1 month ago

The setting is not reasonable, because few-shot performance is highly related to the split. Repeat 5 times at different data split is much more standard instead of a fixed split for each run. If one good split is used, the results are hard to reproduce.

Thank you for your comment. Indeed! In our initial submission, we evaluated the results using experiments with 5 random seeds. This involved calculating the average and variance across 5 random splits while keeping the same settings for all methods. However, most reviewers criticized this approach of using 5 random seeds and suggested using the same seed for all five runs. Therefore, in the final version, we followed this suggestion, running five times with the same seed while maintaining the same settings across all methods.