[solution needed] training with provided fcenet config on custom datasets: loss is nan

yue-cheng-wind commented 2 years ago

Hi: I have a text detection datasets which includes ~326 images and 80% are for training and 20% for testing. I adapted the example dbnet configuration for the training. I used 50 epochs just to check if training is progressing properly. However, what I found out is that the loss diverged to very large value even in the early 1st epoch, which I believe that it must be something wrong with my configuration, however I just do not know how to modify it. I've tried to change learning rate from 0.01 to 0.1, same issue still persists.

Below is my logger file: 20220415_185058.log

FYI below are my annotation files: 7fd67477-9c1c-4267-b54b-0fc56ee32ad8_train.txt 7fd67477-9c1c-4267-b54b-0fc56ee32ad8_test.txt

Could you suggest me any possible direction to debug? Please let me know if you need any other information!

Thank you so much for the help!

xinke-wang commented 2 years ago

It happens usually because the learning rate is too large for your batch size. Please try to use a smaller learning rate.

yue-cheng-wind commented 2 years ago

@xinke-wang Hi Xinke: Thank you so much for your suggestion; it's working now. I'll close this issue

open-mmlab / mmocr

[solution needed] training with provided fcenet config on custom datasets: loss is nan #939