ls1rius / WSI_FiVE

Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction
11 stars 1 forks source link

Great idea and significant performance improvement, but the code is complicated and somewhat disorganized, making it difficult to reproduce the results. #6

Open NBitBuilder opened 2 weeks ago

NBitBuilder commented 2 weeks ago

Thank you for your contributions and for providing the open-source code! This repository explains some key module implementations well. However, the framework adapted from XLIP makes it difficult to read (especially functions and variables of 'videos'), and there is redundant code that creates distractions and hinders understanding.

I decided to reimplement the code and used the implementations of the TFS module and answer augmentation function, along with the text descriptions you provided. I observed that the training accuracy, specifically the slide-reports matching accuracy for TCGA lung cancer, reached 65%, which is a decent number considering there are around 800 pairs. However, the validation accuracy for lung classification (LUAD vs. LUSC) is like a random guess at 50%. I am unsure why this is happening.

Additionally, I noticed another paper presented at CVPR 2024 on slide-level vision-language training (https://github.com/Jiangbo-Shi/ViLa-MIL). The reported accuracy for lung classification with 16-shot learning is only 67.7% (yours are 91.25%). I am not sure what causes this significant difference!

I would appreciate more insights into the pros and cons of this paper compared to ViLa. I have also left notes for the author of ViLa. Moreover, it would be helpful if you could refactor the code further for better readability and result reproduction.

Thanks

Here are the results from ViLa.

image

ls1rius commented 2 weeks ago

We apologize for the inconvenience caused by the redundant content in the code. We will work on improving the code and removing the redundant codes to enhance readability.

Our code offers two methods: 1) training with pre-trained parameters provided by DSMIL, and 2) end-to-end training. This paper primarily focuses on experiments conducted using the first method. If you opt for the end-to-end method, you can perform additional image encoder pre-training, such as using SIMCLR following DSMIL method, which could potentially enhance performance. However, due to our subsequent use of pre-training with pairs of pathological images and reports, the performance improvement was not substantial. If you're using pre-trained parameters and facing issues, please check your training strategy and ensure the number of training epochs is sufficient. We recommend following our settings. These factors greatly influence training effectiveness.

FiVE enhances its generalization by pre-training on pathological images and reports, which is inspired by the rich fine-grained samples used in the recently popular LLM models. Essentially, FiVE can be summarized as a pre-training work. On the other hand, ViLa-MIL is an interesting and impressive work that primarily focuses on training with the current limited dataset without introducing additional data. Given that the training data and task objectives for FiVE and ViLa-MIL are different, a direct comparison of their experimental results might be inappropriate.

Please feel free to contact me if you encounter any other issues or have further questions.

NBitBuilder commented 2 weeks ago

Thank you for your reply. I will check the implementation details.