Cannot overfit net on a small training set

sptom commented 4 years ago

Hi there, First of all, thank you for this code! My first run with the network was unsuccessful, so I tried to make a quick sanity check. It is my understanding that a "healthy" neural net should easily overfit the training data when few examples are given. It should quickly learn to classify them with 100% accuracy by simply "memorising" the images. So I tried to overfit the net for 38 images (35, actually, because 3 are used as validation set) from the Pascal VOC database.

I use binary black=background white=object masks.

Running training for 100 epochs, I still get very poor results - the training loss fluctuates rather heavily around 0.5, as can be seen from the tensor-board plot and the output to screen:

I'm quite certain that this behaviour is quite irregular and that I'm missing something. Do you have an idea what might that be?

The results on the very same training set don't make any sense:

milesial commented 4 years ago

Thank you for the detailed explaination of your problem. From the loss value and the loss plot it seems that your learning rate is way too high, have you tried lowering it? Divide it by 10 first maybe.

Also for a sanity check, you can try with even less images, like 1 or 2, the training will be faster.

juliagong commented 4 years ago

Hi @milesial and @sptom! I'm having this issue of failing to overfit as well, and I'm using 1e-4 as my learning rate and a training set of just a single image. I can't seem to find anything irregular in the training code, but if more than one of us is having this issue, I wonder if there's something we're overlooking? Any insight would be appreciated!

milesial commented 4 years ago

I think your learning rate is too high. For the full carvana dataset I used 2e-6 as a LR

juliagong commented 4 years ago

Thanks; I unfortunately have also tried learning rates on the order of 1e-6 and got the same results. I also disabled all augmentation, normalization, and other regularization so that it's exactly the steps that you used. For some reason, the model isn't overfitting to the one-image dataset and either gives nonsensical segmentations or converges to weights being all 0. Have you had this issue? Do you have any insights? Thanks!

sptom commented 4 years ago

Thanks for the reply Alex, I also tried running the code with various lower orders of magnitude for the learning rate. I also tried using different optimizers, but I have to agree with Julia here, it did not help: The phenomenon that I see is that the lower I take the LR, the quicker I get the loss to converge to 0.7 and stay there, for some reason.

In some of the runs I got somewhat indicative results for the masks, but it's hardly something that could be considered 'overfitting'.

Could it be that the loss funciton is unstable? Or perhaps you added some alteration in a recent update?

Thanks a lot, Tom

milesial commented 4 years ago

There was a lot of small tweaks in recent commits, but nothing that should affect the convergence I think. Are you using transposed conv or the bilinear route (default)? The loss is just the cross entropy so it should be pretty stable.

Do you both work on the same dataset? Have you tried with an image from the Carvana dataset?

This problem is very strange. If you feed it 100 images, does it learn something?

juliagong commented 4 years ago

I'm using the bilinear route. I don't think we're using the same dataset, but we have the same issue. I've tried feeding it 50-100 images in training and it doesn't learn properly; it stays at around 0.7 loss.

juliagong commented 4 years ago

Update: I also tried transposed convolutions and they are not working either. It's such a strange issue and I don't think I've ever encountered something like this before.

I wonder if it is something that's not wrong with the model, but somehow with the training procedure. I have tried both the two models from this repo as well as a pretrained model from a different project, all with the same image. They all do not learn.

juliagong commented 4 years ago

@milesial I spent today debugging once again by rewriting the entire training pipeline from scratch and testing on incrementally more meaningful sets of data, and ended up finding the problem! It was very sneaky.

Your train.py is actually fine, but in eval.py, net.eval() was called. However, net.train() was called only at the beginning of each epoch, while net.eval() was called every batch. So there was only meaningful training going on for one batch per epoch - no wonder! :)

Thanks for your help and quick response on this problem. I hope this fixes @sptom's issue as well.

sptom commented 4 years ago

Oh, Wow, awesome @juliagong ! that sounds really sneaky! I didn't notice that the net.train() was put in the epoch and not in the batch loop... However, I tried to solve this, first by putting net.train() into the batch loop, which didn't help unfortunately. I then also tried to put the code in eval.py into with torch.no_grad(): instead of using net.eval(), but I didn't notice any significant effect by that. Could you say how did you resolve the issue, and also - which parameters did you use afterwards? Was the overfitting accurate? How many epochs did it take to converge to effectively zero loss?

phper5 commented 4 years ago

@milesial I spent today debugging once again by rewriting the entire training pipeline from scratch and testing on incrementally more meaningful sets of data, and ended up finding the problem! It was very sneaky.

Your train.py is actually fine, but in eval.py, net.eval() was called. However, net.train() was called only at the beginning of each epoch, while net.eval() was called every batch. So there was only meaningful training going on for one batch per epoch - no wonder! :)

Thanks for your help and quick response on this problem. I hope this fixes @sptom's issue as well.

i think the real train code is masks_pred = net(imgs) and then count the loss and loss.backward() so every batch it has traind, even net.train() was called only at the beginning of each epoch. am i wrong? thanks

sptom commented 4 years ago

@phper5, for masks_pred = net(imgs) you are right, But the script does calculate the accuracy on the test set several times in an epoch using eval.py

milesial commented 4 years ago

@juliagong Thanks for your investigation ! It is indeed a big mistake that is there from a long time ago and needs a fix. But do the train and eval methods of the net module affect something else than BatchNorms here? When you say there is no meaningful training, I'm not sure, since there is still gradient updates on other layers, it's just the BatchNorms that are broken (?)

Thanks to all of you for participating in this

phper5 commented 4 years ago

@phper5, for masks_pred = net(imgs) you are right, But the script does calculate the accuracy on the test set several times in an epoch using eval.py

Yes you are right. sorry I didn't look carefully

sptom commented 4 years ago

Oh, Wow, awesome @juliagong ! that sounds really sneaky! I didn't notice that the net.train() was put in the epoch and not in the batch loop... However, I tried to solve this, first by putting net.train() into the batch loop, which didn't help unfortunately. I then also tried to put the code in eval.py into with torch.no_grad(): instead of using net.eval(), but I didn't notice any significant effect by that. Could you say how did you resolve the issue, and also - which parameters did you use afterwards? Was the overfitting accurate? How many epochs did it take to converge to effectively zero loss?

@juliagong could you please share how did you resolve the issue? Because I did the above and it doesn't seem to help, still converging to 0.6 loss

gboy2019 commented 4 years ago

Oh, Wow, awesome @juliagong ! that sounds really sneaky! I didn't notice that the net.train() was put in the epoch and not in the batch loop... However, I tried to solve this, first by putting net.train() into the batch loop, which didn't help unfortunately. I then also tried to put the code in eval.py into with torch.no_grad(): instead of using net.eval(), but I didn't notice any significant effect by that. Could you say how did you resolve the issue, and also - which parameters did you use afterwards? Was the overfitting accurate? How many epochs did it take to converge to effectively zero loss?

@juliagong could you please share how did you resolve the issue? Because I did the above and it doesn't seem to help, still converging to 0.6 loss

me too，sometimes 0.7，sometimes 0.6，sometimes 0.8！so how fix this issue？

ProfessorHuang commented 4 years ago

@milesial I spent today debugging once again by rewriting the entire training pipeline from scratch and testing on incrementally more meaningful sets of data, and ended up finding the problem! It was very sneaky.

Your train.py is actually fine, but in eval.py, net.eval() was called. However, net.train() was called only at the beginning of each epoch, while net.eval() was called every batch. So there was only meaningful training going on for one batch per epoch - no wonder! :)

Thanks for your help and quick response on this problem. I hope this fixes @sptom's issue as well.

Invaluable finding! Thank you very much. milesial's code is pretty, I think I can understand it, but I can't figure out Why it doesn't work on my dataset. Today, When I changed the code as you said, everything becomes good.

milesial commented 4 years ago

Hi all, I modified to switch back to train mode in https://github.com/milesial/Pytorch-UNet/commit/773ef215d41c1f36dc0ed4159c12df89d792fbc3

sptom commented 4 years ago

Thanks, @milesial for your update. However, I'm sorry to say that this did not resolve the issue. Loss is still stuck around 0.6 I tried reducing the dataset to 11 copies of a single image, and the loss is now stuck on 0.3:

The above is the result after running 15 epochs on 11 copies of the same image!

Were you able to overfit a set of 10 different images? If so, how many epochs did it take, and with which parameters?

@ProfessorHuang , What exactly did you change in the code?

gboy2019 commented 4 years ago

Thanks, @milesial for your update. However, I'm sorry to say that this did not resolve the issue. Loss is still stuck around 0.6 I tried reducing the dataset to 11 copies of a single image, and the loss is now stuck on 0.3:

The above is the result after running 15 epochs on 11 copies of the same image!

Were you able to overfit a set of 10 different images? If so, how many epochs did it take, and with which parameters?

@ProfessorHuang , What exactly did you change in the code?

could you share your code？loss of mine is 1e+3 sometimes，very terrible ，so show your code？

shilei2403 commented 3 years ago

@milesial I spent today debugging once again by rewriting the entire training pipeline from scratch and testing on incrementally more meaningful sets of data, and ended up finding the problem! It was very sneaky. Your train.py is actually fine, but in eval.py, net.eval() was called. However, net.train() was called only at the beginning of each epoch, while net.eval() was called every batch. So there was only meaningful training going on for one batch per epoch - no wonder! :) Thanks for your help and quick response on this problem. I hope this fixes @sptom's issue as well.

Invaluable finding! Thank you very much. milesial's code is pretty, I think I can understand it, but I can't figure out Why it doesn't work on my dataset. Today, When I changed the code as you said, everything becomes good.

how do you change the code ?could you show the details?

karlita101 commented 3 years ago

Hi there, I am also wondering if there's been any updates or willingness to share how they've overcame this issue.

Very much appreciated!

@juliagong @ProfessorHuang

Li-Wei-NCKU commented 3 years ago

Hi all, I modified to switch back to train mode in 773ef21

The modify makes sense to me logically. However, the loss is still stuck and the model couldn't overfit on a small dataset with only 20 images. Is there any kind suggestion? @milesial @juliagong @ProfessorHuang

AJSVB commented 3 years ago

I might be wrong, but I had similar issue, and reducing weight decay and momentum helped me overfitting

rgkannan676 commented 3 years ago

Hi,

For small datasets, reducing the evaluation frequency reduced the training loss for me. This is will avoid learning rate becoming small value after few steps.

Example.

# Evaluation round
if global_step % (n_train // (0.25 * batch_size)) == 0:

Flyingdog-Huang commented 3 years ago

Hi,

For small datasets, reducing the evaluation frequency reduced the training loss for me. This is will avoid learning rate becoming small value after few steps.

Example.
# Evaluation round
if global_step % (n_train // (0.25 * batch_size)) == 0:

@rgkannan676 Thanks a lot ,this way helps the loss value becomes normal in small dataset for me ,I want to know why it can avoid LR becoming small after few steps, and now I will think how can reduce the loss shoking like this

微信图片_20210911221748

hope can recieve your reply, thanks again @rgkannan676

Flyingdog-Huang commented 3 years ago

maybe I find the reason

k-nayak commented 3 years ago

Hi,

For small datasets, reducing the evaluation frequency reduced the training loss for me. This is will avoid learning rate becoming small value after few steps.

Example.
# Evaluation round
if global_step % (n_train // (0.25 * batch_size)) == 0:

I have a dataset of 260 images and 0.25 did help significantly to reduce loss, but the DIce Coefficient has remained stagnant at 0.36. Is there any way to improve the DICE score ? It is unable to generalize when given a new image.

Flyingdog-Huang commented 3 years ago

Hi, For small datasets, reducing the evaluation frequency reduced the training loss for me. This is will avoid learning rate becoming small value after few steps. Example.
# Evaluation round
if global_step % (n_train // (0.25 * batch_size)) == 0:
I have a dataset of 260 images and 0.25 did help significantly to reduce loss, but the DIce Coefficient has remained stagnant at 0.36. Is there any way to improve the DICE score ? It is unable to generalize when given a new image.

I also meet this problem , and I am thinking a way of that

k-nayak commented 3 years ago

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found.

Thanks in advance.

Flyingdog-Huang commented 3 years ago

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found.

Thanks in advance.

hi, what about your project? 2 class or more class ?

k-nayak commented 3 years ago

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found. Thanks in advance.

hi, what about your project? 2 class or more class ?

mine is binary segmentation

Flyingdog-Huang commented 3 years ago

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found. Thanks in advance.

hi, what about your project? 2 class or more class ?

mine is binary segmentation

oh, that is weird, mine is multi-seg

Flyingdog-Huang commented 3 years ago

Same here. Will update incase I get a fix for better Dice. Please do update if any fix is found. Thanks in advance.

hi, what about your project? 2 class or more class ?

mine is binary segmentation

I want to know if the target is smaller than the backgroud in your dataset? my situation is : my dataset is small , and target is smaller than backgroud. in train part, I compute dice loss with backgroud channel, and the loss works well. in evaluate part, I compute dice without backgroud channel , and the dice works bad.

and I found that the target is almost same as backgroud in this project's data, so I guess the reason that dice is not good enough is we do not compute the big backgroud channel. next I will analyze the relationship of dice and big backgroud in math way.

k-nayak commented 3 years ago

Very interesting Approach @Flyingdog-Huang. My data set is small as well and the target is smaller compared to the whole image and there are water droplets, which are difficult to distinguish at times and also making masks. Based on the lighting conditions the data tends to be better or worse at times. The model is unable to distinguish at times if a droplet is present or not since it's transparent and lacks robust edges most of the times. I believe with my data set the problem is the the dataset itself. I am implementing it to do real-time segmentation in a video. And the results are not very good. Attention Unet perfomed a little better than U-Net and i am checking If residual Attention U-Net can perform better. My Dice score is around 0.73 with a loss of 0.20 .

milesial / Pytorch-UNet

Cannot overfit net on a small training set #165