med-air / Endo-FM

[MICCAI'23] Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train
Apache License 2.0
154 stars 15 forks source link

Question about downstream task test result #7

Closed ZhangT-tech closed 4 months ago

ZhangT-tech commented 9 months ago

Hi,

i was running the experiment for PolypDiag downstream task, when I use the fine-tuned weights as the pretrained_model_weights, each time when I run the test_finetune_polypdiag.sh, it will give me different test result, may I know why this happened? Supposedly, it should be the same test result every time we run the test with the same val 80 test, and same model, right?

Kyfafyd commented 9 months ago

Hi, Thanks for your interest! I would like to learn the results you have got?

ZhangT-tech commented 9 months ago

Hi, thanks for your follow-up:

It is not stable, for the first running of experiment, it will give me the same answer when I run the eval_polypdiag_finetune, like 91.6%, it remains consistent when I run the test. However, if i use the same pretrained weight I just have from last eval, it will give me a different number, like 80%, 76% and so on. I don't see why with the same model on the same test dataset, the F1 score will be different. Have you had this problem?

10086ddd commented 6 months ago

I also encountered this problem, and my test results were even lower. Of course, since my training environment does not support distributed training, I annotated the relevant code. Is this also related to this?

Kyfafyd commented 6 months ago

Hi, @10086ddd I have found that the model is not in eval mode during test, and the saved model only contains the linear classifier, without the fine-tuned backbone. These issues are addressed currently, you can pull the latest code and try again! https://github.com/med-air/Endo-FM/blob/f9136ebc5fe28869d3b28a50fb734eab0d25c2b0/eval_finetune.py#L218 https://github.com/med-air/Endo-FM/blob/f9136ebc5fe28869d3b28a50fb734eab0d25c2b0/eval_finetune.py#L168-L175

10086ddd commented 6 months ago

Hi,thanks for your follow-up: @Kyfafyd I will test the new code in the near future and inform you of the results in a timely manner. Thanks for your work again. Also, I have a small question that I would like to ask you. If I want to switch to another endoscopic dataset for testing downstream classification tasks, do I only need to make downstream Fine-tuning first and then test directly?

Kyfafyd commented 6 months ago

Hi, @10086ddd Yes, that's exactly right! You can refer to this issue for more details: https://github.com/med-air/Endo-FM/issues/12

10086ddd commented 6 months ago

Hi, @Kyfafyd I tested the new code today, a total of 5 times. Of course, due to environmental issues, I commented out the relevant code for distributed training. But the results of the five tests were 74.8%, 34.1%, 83.8%, 45.4%, and 85.7%, respectively. Does this mean that the model is still not very stable? Additionally, I have a question about the classification of downstream tasks in the PolypDiag dataset. Is each frame of the video labeled as abnormal labeled as diseased?

Kyfafyd commented 6 months ago

Hi @10086ddd Are you using the latest updated model, which is in readme: https://mycuhk-my.sharepoint.com/personal/1155167044_link_cuhk_edu_hk/_layouts/15/onedrive.aspx?id=%2Fpersonal%2F1155167044%5Flink%5Fcuhk%5Fedu%5Fhk%2FDocuments%2FEndo%2DFM%2Fdownstream%5Fweights%2Fpolypdiag%2Epth&parent=%2Fpersonal%2F1155167044%5Flink%5Fcuhk%5Fedu%5Fhk%2FDocuments%2FEndo%2DFM%2Fdownstream%5Fweights&ga=1 My testing result using this model is consistently 91.5% over time.

This PolypDiag task is a video level task, diagnosis each video is with disease or not.

10086ddd commented 6 months ago

Hi,@Kyfafyd Sorry, it was my mistake. I did forget to use the latest updated model. Also, regarding the dataset of the PolypDiag task, do you mean that in the future, if I want to use some diseased and disease-free endoscopic videos for fine-tuning and testing, wouldn't diseased videos need to include diseased areas in every frame?

Kyfafyd commented 6 months ago

Hi, @10086ddd Yes, this is a classification task, so no area is needed. If you want to perform lesion detection, you can refer to STFT in this repo.

10086ddd commented 6 months ago

Hi, @Kyfafyd Thank you for your work and answer. Actually, what I want to say is the classification task. Previously, I meant to ensure that every frame in the video belongs to the same category. Now it seems unnecessary?

Kyfafyd commented 6 months ago

Hi, @10086ddd To be note that this task is to recognize one video is diseased or not. So it is not necessary.

10086ddd commented 6 months ago

OK,Thanks @Kyfafyd

10086ddd commented 6 months ago

Hi, @Kyfafyd I'm very sorry to disturb you again. But after using the latest updated model yesterday, the test results are still unstable. I noticed that the source code uses distributed training. Because my environment does not support it, I commented out this part of the code. But I haven't made any other modifications. If I only train on one GPU, wouldn't it be necessary to not only annotate the distributed training related code, but also modify the model's parameters and configuration files?

Kyfafyd commented 6 months ago

Hi, @10086ddd You may test the model under distributed environment. Only 1 gpu can also setup distributed scrnario.

10086ddd commented 6 months ago

Hi,@Kyfafyd I'm sorry for replying to you only today. During this time, I have been testing on a Linux server with one GPU, and the model results have been successfully stable. Thank you for your previous answer. But the result remained stable at 66%, and I think I should adjust it again.

Kyfafyd commented 6 months ago

Hi, @10086ddd ARe you using the latest code and latest weight?

10086ddd commented 6 months ago

Hi, @Kyfafyd Yes,I downloaded the new project code and weight file again and tested it again, but the result was still 66.1%.

联想截图_20240327213120
Kyfafyd commented 6 months ago

Hi @10086ddd I forget to add the line for loading updated backbone during testing. You can try the latest code! Currently, it may give the correct result.

Screenshot 2024-03-28 at 00 30 20
10086ddd commented 6 months ago

Hi, @Kyfafyd The new code has successfully achieved the expected results. Thank you for your work.

10086ddd commented 6 months ago

Hi,@Kyfafyd I have a small question, can PolypDiag downstream tasks handle videos with longer processing times? For example, videos lasting more than 10 minutes?

Kyfafyd commented 5 months ago

Hi, @10086ddd Are you performing video-level or frame-level task?

10086ddd commented 5 months ago

Hi, @Kyfafyd Yes,isn't the PolypDiag downstream task a video level task?

Kyfafyd commented 5 months ago

PolypDiag is video-level. I think you can try for videos lasting more than 10 minutes. Increasing the sampling frames for each input video may help to improve the performance. (by adding DATA.NUM_FRAMES 16 in the fine-tuning script)

10086ddd commented 5 months ago

Ok,Thanks for your answer. @Kyfafyd