Problem during training my own model, val loss=nan with RuntimeWarning

franklu323 commented 6 years ago

Hi, I got some issue during training my own model. I only have 1 class and I already change cfg file class=1 and filters=18. And I convert it to h5 file. But during the training process, validation loss remains NaN (for both old version/new version codes). It works fine before and I didn't change any of these codes. This is on the new version of the codes, it works with RuntimeWarning when training, and it won't stop before 50 epochs. Just stopped during fine-tune (out of memory). This is on the previous version of the codes, and it stopped before 10 epochs (early stop) with NaN on validation loss.

Can anyone help me with this problem? I convert weights by using 'python convert.py -w yolov3.cfg yolov3.weights model_data/yolo_weights.h5'. And I change all relevant class number and filters before training. Thanks

Pichairen commented 6 years ago

I met the same problem when I change the cfg file and then convert weights by using 'python convert.py -w yolov3.cfg yolov3.weights model_data/yolo_weights.h5, and I found it runs well with the yolo_weights.h5 which never change anything on cfg file,but it output score is very low.

franklu323 commented 6 years ago

Same here, but previously, the code works fine, even after I change the cfg file. But right now, it just keep telling me runtime error and val_loss remains NaN at the whole training stage. And if I just use original cfg to convert weights, it works. If you slove this problem, please let me know too. Thank you very much. @18814181500 @qqwweee

Pichairen commented 6 years ago

@franklu323 After training in many different situations,It looks well with load yolo_weights.h5 with freezing layers now. I tried load/not load darknet53_weights.h5 with/without freezing the previous layers, and got four different trained_weights finnally, but all of them can't drew a good box and got very low score always(<0.5), and I can't tell why now. Then I tried load yolo_weights with freezing layers to train and got a nice score and box, and I tried load yolo_weights without freezing layers to train and It's output is little worst than freezing the layers,particularlly multiple object in one picture. By the way, my dataset is just 183 pictures, and training epoch is 500. and I only train one class.

JingangLang commented 6 years ago

不修改cfg文件，然后转化模型接着训练的话一点问题都没有！但是修改完cfg，训练损失正常然而验证损失一直是nan！请问这是怎么回事呢？我用的最新的代码。也看了之前关闭的类似的问题，并没有解决。谢谢@qqwweee

warsi22 commented 6 years ago

I am also getting same error. Any one can help :(

warsi22 commented 6 years ago

@franklu323 . Did you get it resolved ?

hxy1051653358 commented 5 years ago

@JingangLang I am also getting same error.Did you get it resolved?

ZhiweiDuan commented 5 years ago

不修改cfg文件，然后转化模型接着训练的话一点问题都没有！但是修改完cfg，训练损失正常然而验证损失一直是nan！请问这是怎么回事呢？我用的最新的代码。也看了之前关闭的类似的问题，并没有解决。谢谢 @JingangLang 这个问题解决没，望回复，谢谢

iris-qq commented 5 years ago

@JingangLang _我也遇见这个问题了，修改了config文件之后，loss是正常。val-loss一直是nan。不知道问题在哪。不知道您有没有解决这个问题，求指教。谢谢 1545901553766

franklu323 commented 5 years ago

I didn't solve this problem, it seems u don't need to change the cfg file. and i don't know y.

iris-qq commented 5 years ago

I didn't solve this problem, it seems u don't need to change the cfg file. and i don't know y.

are you sure?,in fact,the first time,i did like you say,but the model wasnt effective.

enoceanwei commented 5 years ago

@franklu323

Sorry to disturb you,

I am suffering the same problem with you, I used the original yolov3.cfg file, it can work normal, the val_loss is pretty good, however, the model detected score is very low (<0.3) after the stage1 training(50 epoch).

On the contrary, I changed the yolov3.cfg and revised the 'classes, filters etc', the val_loss becomes Nan, I have no idea about it, Have you solved it?

I am looking forward to your reply, many thanks.

Kind regards

Wei

bodhwani commented 5 years ago

Same issue. Val_loss is coming nan.

MinghuiJ commented 5 years ago

I met this problem too. And I can't solve it. Val_loss is nan. May you give us some explaination? Best wishes to you. @qqwweee

lingshaokun commented 5 years ago

不修改cfg文件，然后转化模型接着训练的话一点问题都没有！但是修改完cfg，训练损失正常然而验证损失一直是nan！请问这是怎么回事呢？我用的最新的代码。也看了之前关闭的类似的问题，并没有解决。谢谢@qqwweee

不修改cfg文件，那h5文件load不会报错吗，在自己的数据的话

dmytro-kushnir commented 4 years ago

got the same problem with custom yolov3 conifg (2 classes), default acnhors with normal config i configure config according to this instruction: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects and get val_loss nan and very bad training.

on the other hand, with default cfg, everything looks fine, but i can't load this model, according to this issue: https://github.com/qqwweee/keras-yolo3/issues/48 @qqwweee @franklu323 @bodhwani @MinghuiJ

nyaang commented 4 years ago

The same problem too. @qqwweee

thanif commented 4 years ago

I was experiencing the same problem while using tiny-yolo and the reason was that the weights were wrong. I downloaded the yolov3-tiny.weights from darknet along with the yolov3-tiny.cfg and convert it using the command :

python3 convert.py yolov3-tiny.cfg yolov3-tiny.weights tiny-yolo.h5

The problem was resolved.

amalbros commented 4 years ago

I am also facing the same issue.none of the suggestions worked.any help would be appreciated. @qqwweee

musicshmily commented 4 years ago

when I change the .cfg file,I come across the same question : val_loss is nan ,but when i enforce it come to the second train stage: Unfreeze all the layers ,the val_loss go to normal,and the I use the weights go back to train stage 1,it is ok.

qqwweee / keras-yolo3

Problem during training my own model, val loss=nan with RuntimeWarning #119