Great idea and significant performance improvement, but the code is complicated and somewhat disorganized, making it difficult to reproduce the results.

NBitBuilder commented 4 months ago

Thank you for your contributions and for providing the open-source code! This repository explains some key module implementations well. However, the framework adapted from XLIP makes it difficult to read (especially functions and variables of 'videos'), and there is redundant code that creates distractions and hinders understanding.

I decided to reimplement the code and used the implementations of the TFS module and answer augmentation function, along with the text descriptions you provided. I observed that the training accuracy, specifically the slide-reports matching accuracy for TCGA lung cancer, reached 65%, which is a decent number considering there are around 800 pairs. However, the validation accuracy for lung classification (LUAD vs. LUSC) is like a random guess at 50%. I am unsure why this is happening.

Additionally, I noticed another paper presented at CVPR 2024 on slide-level vision-language training (https://github.com/Jiangbo-Shi/ViLa-MIL). The reported accuracy for lung classification with 16-shot learning is only 67.7% (yours are 91.25%). I am not sure what causes this significant difference!

I would appreciate more insights into the pros and cons of this paper compared to ViLa. I have also left notes for the author of ViLa. Moreover, it would be helpful if you could refactor the code further for better readability and result reproduction.

Thanks

Here are the results from ViLa.

ls1rius commented 4 months ago

We apologize for the inconvenience caused by the redundant content in the code. We will work on improving the code and removing the redundant codes to enhance readability.

Our code offers two methods: 1) training with pre-trained parameters provided by DSMIL, and 2) end-to-end training. This paper primarily focuses on experiments conducted using the first method. If you opt for the end-to-end method, you can perform additional image encoder pre-training, such as using SIMCLR following DSMIL method, which could potentially enhance performance. However, due to our subsequent use of pre-training with pairs of pathological images and reports, the performance improvement was not substantial. If you're using pre-trained parameters and facing issues, please check your training strategy and ensure the number of training epochs is sufficient. We recommend following our settings. These factors greatly influence training effectiveness.

FiVE enhances its generalization by pre-training on pathological images and reports, which is inspired by the rich fine-grained samples used in the recently popular LLM models. Essentially, FiVE can be summarized as a pre-training work. On the other hand, ViLa-MIL is an interesting and impressive work that primarily focuses on training with the current limited dataset without introducing additional data. Given that the training data and task objectives for FiVE and ViLa-MIL are different, a direct comparison of their experimental results might be inappropriate.

Please feel free to contact me if you encounter any other issues or have further questions.

NBitBuilder commented 4 months ago

Thank you for your reply. I will check the implementation details.

NBitBuilder commented 4 months ago

Hi,

I tried to reproduce the TCGA lung cancer binary classification using your weight file "five_fix_pth_95.4.pth" and your code without making any changes.

However, I encountered two issues:

The zero-shot classification accuracy for binary classification (LUAD vs. LUSC) is significantly higher than the values reported in Table 3 of your paper. Around 90% on the held-out validation 87 samples 'LUAD_LUSC_data_val_reid'. Could you explain why this might be? Is there any possibility of data leakage during training and zero-shot classification?

The zero-shot classification accuracy for the LUAD subtypes is only 13% (approx 1/8, random guess), much lower than the values reported in Table 2, around 60%. Here are the subtype labels I used.

Could you please provide any explanations for these discrepancies?

Thank you.

subtypes.csv subtype_labels.csv

ls1rius commented 4 months ago

Thank you for your interest in our work.

The data in LUAD_LUSC_data_train_reid.csv and LUAD_LUSC_data_val_reid.csv partially overlap. Although binary classification labels are not used in the training process, there may still be risks of data leakage. Instead, please use the file LUAD_LUSC_data_desc_reid_delval.csv as the training data. Additionally, due to the sampling strategy used in our training method, the performance on the validation set fluctuates significantly. We have discussed these performance fluctuations in section 4.6.2 of the article. Additionally, we conducted multiple experiments with different splits of the training set and validation set, which also contributed to the performance fluctuations. Consequently, our experimental results are based on the average performance across multiple experiments, which may explain the seemingly high performance outcomes.

It is recommended not to use weight file when you are doing subtype experiments, because it is fine-tuned on the binary classification task and will lose some generalization ability. It is recommended that you start training from scratch or re-fine-tune on the wsi-report data.

Please feel free to contact me if you encounter any other issues or have further questions.