Question about overfitting

jannisborn / covid19_ultrasound

Open source lung ultrasound (LUS) data collection initiative for COVID-19.

https://www.mdpi.com/2076-3417/11/2/672

152 stars 80 forks source link

Question about overfitting #119

Closed ivolis closed 1 year ago

ivolis commented 1 year ago

Hi!

I am working on a similar project and I am having trouble training a classifier using ultrasound images (covid vs normal/baseline), given that using Lung Ultrasound you are working with similar images (meaning, you are training a model with many frames from the "same patient").

Because of that, I assume, my model is learning too quick too good my training set and it is not having room to improve in order to have a good performance with my validation/test set (kinda looking as overfitting).

How do you manage to overcome that issue? (IF you have it).

Thanks in advance and really great work!

jannisborn commented 1 year ago

You have to keep out patients from training data and measure performance only for new patients. That's it. It sounds like you have frames from one patient in training and test dataset. If that's the case it's 100% garbage and entirely meaningless. Think about the clinical setting for one second. When would such a model be useful?

ivolis commented 1 year ago

I am measuring performance only with new patients. I am not using some frames from patient A for training and other frames from the same patient A to test. I am using patient A for training and patient B for testing.

My question was more oriented towards the idea that, since I am using frames from the same ultrasound video (for 1 patient in particular, but this would happen to all of them), inevitably my training batchs may have several images of the same patient when training the model. That could be a problem?

jannisborn commented 1 year ago

Gotcha, now I understand better. Generally it's beneficial if the data distribution is preserved within every batch. So basically all theory about samples being IID between training and test data, the same should roughly apply to every batch.

In practice what this means that you lower the risk of overfitting if you use stratified sampling in the dataloader. You can simultaneously stratify on the class labels and on the videos. So if you have one video that contains 10% of your total frames then, every batch should contain roughly 10% frames from that video. You have to ensure heterogeneity within every batch

ivolis commented 1 year ago

Oh, I think I get it.

I also read on your paper the following: "Data was split on a patient-level, hence it was ensured that the frames of a single video are present within a single fold only, and that the number of videos per class is similar in all folds."

So you are saying that within my batch I should also avoid having many frames from the same patient, right?

Meaning that my batch of, for example, 6 should be: 1 random frame from the ultrasound video of patient A 1 random frame from the ultrasound video of patient B 1 random frame from the ultrasound video of patient C 1 random frame from the ultrasound video of patient D 1 random frame from the ultrasound video of patient E 1 random frame from the ultrasound video of patient F

and NOT 2 random frames from the ultrasound video of patient A 1 random frame from the ultrasound video of patient B 3 random frame from the ultrasound video of patient C 1 random frame from the ultrasound video of patient D

(just a toy example)

jannisborn commented 1 year ago

Jup exactly

ivolis commented 1 year ago

Thank you!