ls1rius / WSI_FiVE

Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction
21 stars 1 forks source link

Great idea and significant performance improvement, but the code is complicated and somewhat disorganized, making it difficult to reproduce the results. #6

Closed NBitBuilder closed 2 months ago

NBitBuilder commented 5 months ago

Thank you for your contributions and for providing the open-source code! This repository explains some key module implementations well. However, the framework adapted from XLIP makes it difficult to read (especially functions and variables of 'videos'), and there is redundant code that creates distractions and hinders understanding.

I decided to reimplement the code and used the implementations of the TFS module and answer augmentation function, along with the text descriptions you provided. I observed that the training accuracy, specifically the slide-reports matching accuracy for TCGA lung cancer, reached 65%, which is a decent number considering there are around 800 pairs. However, the validation accuracy for lung classification (LUAD vs. LUSC) is like a random guess at 50%. I am unsure why this is happening.

Additionally, I noticed another paper presented at CVPR 2024 on slide-level vision-language training (https://github.com/Jiangbo-Shi/ViLa-MIL). The reported accuracy for lung classification with 16-shot learning is only 67.7% (yours are 91.25%). I am not sure what causes this significant difference!

I would appreciate more insights into the pros and cons of this paper compared to ViLa. I have also left notes for the author of ViLa. Moreover, it would be helpful if you could refactor the code further for better readability and result reproduction.

Thanks

Here are the results from ViLa.

image

ls1rius commented 5 months ago

We apologize for the inconvenience caused by the redundant content in the code. We will work on improving the code and removing the redundant codes to enhance readability.

Our code offers two methods: 1) training with pre-trained parameters provided by DSMIL, and 2) end-to-end training. This paper primarily focuses on experiments conducted using the first method. If you opt for the end-to-end method, you can perform additional image encoder pre-training, such as using SIMCLR following DSMIL method, which could potentially enhance performance. However, due to our subsequent use of pre-training with pairs of pathological images and reports, the performance improvement was not substantial. If you're using pre-trained parameters and facing issues, please check your training strategy and ensure the number of training epochs is sufficient. We recommend following our settings. These factors greatly influence training effectiveness.

FiVE enhances its generalization by pre-training on pathological images and reports, which is inspired by the rich fine-grained samples used in the recently popular LLM models. Essentially, FiVE can be summarized as a pre-training work. On the other hand, ViLa-MIL is an interesting and impressive work that primarily focuses on training with the current limited dataset without introducing additional data. Given that the training data and task objectives for FiVE and ViLa-MIL are different, a direct comparison of their experimental results might be inappropriate.

Please feel free to contact me if you encounter any other issues or have further questions.

NBitBuilder commented 5 months ago

Thank you for your reply. I will check the implementation details.

NBitBuilder commented 4 months ago

Hi,

I tried to reproduce the TCGA lung cancer binary classification using your weight file "five_fix_pth_95.4.pth" and your code without making any changes.

However, I encountered two issues:

The zero-shot classification accuracy for binary classification (LUAD vs. LUSC) is significantly higher than the values reported in Table 3 of your paper. Around 90% on the held-out validation 87 samples 'LUAD_LUSC_data_val_reid'. Could you explain why this might be? Is there any possibility of data leakage during training and zero-shot classification?

The zero-shot classification accuracy for the LUAD subtypes is only 13% (approx 1/8, random guess), much lower than the values reported in Table 2, around 60%. Here are the subtype labels I used.

Could you please provide any explanations for these discrepancies?

Thank you.

subtypes.csv subtype_labels.csv

ls1rius commented 4 months ago

Thank you for your interest in our work.

The data in LUAD_LUSC_data_train_reid.csv and LUAD_LUSC_data_val_reid.csv partially overlap. Although binary classification labels are not used in the training process, there may still be risks of data leakage. Instead, please use the file LUAD_LUSC_data_desc_reid_delval.csv as the training data. Additionally, due to the sampling strategy used in our training method, the performance on the validation set fluctuates significantly. We have discussed these performance fluctuations in section 4.6.2 of the article. Additionally, we conducted multiple experiments with different splits of the training set and validation set, which also contributed to the performance fluctuations. Consequently, our experimental results are based on the average performance across multiple experiments, which may explain the seemingly high performance outcomes.

It is recommended not to use weight file when you are doing subtype experiments, because it is fine-tuned on the binary classification task and will lose some generalization ability. It is recommended that you start training from scratch or re-fine-tune on the wsi-report data.

Please feel free to contact me if you encounter any other issues or have further questions.

junjianli106 commented 3 months ago

Thank you for your interest in our work.

The data in LUAD_LUSC_data_train_reid.csv and LUAD_LUSC_data_val_reid.csv partially overlap. Although binary classification labels are not used in the training process, there may still be risks of data leakage. Instead, please use the file LUAD_LUSC_data_desc_reid_delval.csv as the training data. Additionally, due to the sampling strategy used in our training method, the performance on the validation set fluctuates significantly. We have discussed these performance fluctuations in section 4.6.2 of the article. Additionally, we conducted multiple experiments with different splits of the training set and validation set, which also contributed to the performance fluctuations. Consequently, our experimental results are based on the average performance across multiple experiments, which may explain the seemingly high performance outcomes.

It is recommended not to use weight file when you are doing subtype experiments, because it is fine-tuned on the binary classification task and will lose some generalization ability. It is recommended that you start training from scratch or re-fine-tune on the wsi-report data.

Please feel free to contact me if you encounter any other issues or have further questions.

我在复现的过程中,发现zero shot 二分类,可以到94%(训练过程中),用的是LUAD_LUSC_data_desc_reid_delval.csv训练的。作者您zero shot的时候是选取的最后一个epoch进行测试的吗?还是训练过程中最好的?

ls1rius commented 3 months ago

取的平均值,zero shot整体性能波动也较大,平均值更为可靠 同时fix pth和end2end都测了,二者总共的平均值。

junjianli106 commented 3 months ago

取的平均值,zero shot整体性能波动也较大,平均值更为可靠 同时fix pth和end2end都测了,二者总共的平均值。

@ls1rius 非常感谢您的耐心解答。但是还是有一些疑惑。

  1. zero shot是所有epoch的平均值?还是某些epoch的平均值?还是fix pth和end2end两种方式的最后一个epoch的平均值?
  2. few shot也是取的多个epoch的平均值吗?
  3. 在论文中,有提到在TCGA数据集上进行了细粒度预训练,然后在Camelyon16和TCGA Lung癌症数据集上对WSI分类进行了完全监督实验(表4)。这里全监督实验是结合病理和文本信息(标签的prompt)来做分类任务吗?
ls1rius commented 3 months ago

1.loss收敛后开始收集数值计算平均值。同时也会多换几次train val数据来计算。fix pth和end2end两种方式都会算,取总体平均值 2.是 3.全监督就是zeroshot后采用标签数据进行finetune

junjianli106 commented 3 months ago

Thank you for your interest in our work.

The data in LUAD_LUSC_data_train_reid.csv and LUAD_LUSC_data_val_reid.csv partially overlap. Although binary classification labels are not used in the training process, there may still be risks of data leakage. Instead, please use the file LUAD_LUSC_data_desc_reid_delval.csv as the training data. Additionally, due to the sampling strategy used in our training method, the performance on the validation set fluctuates significantly. We have discussed these performance fluctuations in section 4.6.2 of the article. Additionally, we conducted multiple experiments with different splits of the training set and validation set, which also contributed to the performance fluctuations. Consequently, our experimental results are based on the average performance across multiple experiments, which may explain the seemingly high performance outcomes.

It is recommended not to use weight file when you are doing subtype experiments, because it is fine-tuned on the binary classification task and will lose some generalization ability. It is recommended that you start training from scratch or re-fine-tune on the wsi-report data.

Please feel free to contact me if you encounter any other issues or have further questions.

@ls1rius 您好,又来麻烦您了。我用预训练的权重,测试luad的subtype的zero shot任务,top1的acc也只有13%左右,使用的上面这个issue提出者给的文件(subtypes.csv,subtype_labels.csv))。是不是这个subtype的内容不准确?【在luad和lusc二分类的zero shot任务上测试没有问题】 此外,作者您是否能提供luad和lusc的subtype做zero shot的相关csv文件。非常感谢。

ls1rius commented 3 months ago

文件没问题,预训练pth泛化性有点差,建议end2end。

junjianli106 commented 3 months ago

文件没问题,预训练pth泛化性有点差,建议end2end。

好的,谢谢哈。作者您是否能提供luad和lusc的subtype做zero shot的相关csv文件。非常感谢。

ls1rius commented 3 months ago

此issue提出者给的文件给的文件就可以用,原始文件为 ./gpt_preprocess/luad_tcga_pub_clinical_data.tsv 只有luad的subtype数据,暂时未找到lusc的subtype数据

junjianli106 commented 3 months ago

此issue提出者给的文件给的文件就可以用,原始文件为 ./gpt_preprocess/luad_tcga_pub_clinical_data.tsv 只有luad的subtype数据,暂时未找到lusc的subtype数据

论文中Table 2的TCGA-LUSC结果,不是LUSC的sutype的结果吗?

junjianli106 commented 3 months ago

此issue提出者给的文件给的文件就可以用,原始文件为 ./gpt_preprocess/luad_tcga_pub_clinical_data.tsv 只有luad的subtype数据,暂时未找到lusc的subtype数据

论文中Table 2的TCGA-LUSC结果,不是LUSC的sutype的结果吗?

抱歉,看错了,是在lusc和luad上预训练,然后在luad上zero shot