qqwweee / keras-yolo3

A Keras implementation of YOLOv3 (Tensorflow backend)
MIT License
7.14k stars 3.45k forks source link

Low val_loss but could find anything #512

Open derekhsu opened 5 years ago

derekhsu commented 5 years ago

I have a question about image detection. I used my own dataset to train and get several weights. Some of them have very high val_loss and some of them have very low, such as 17.504.

When I used them to predict my images, I found the weight with large val_loss could find some things from a training image (but very inaccurate), and when I use the weights have very low val_loss, I could find anything from the same image. Even I tune the thresholds to a very low value. Can anyone help? What did cause this situation?

IrisDinge commented 5 years ago

my bbox input has been generalized before trained model so I changes type of bbox input from int to float.

But I got similar problem. Although I got loss 9.645, found nothing in test set. Even I tried my model in training set, still nothing has been found. Basically, using training set to test, there must be a result, right? I got confused too. @qqwweee could u pls offer some ideas?

IrisDinge commented 5 years ago

Ok, I update myself. I start train my own dataset from very beginning without pertrained weights. loss is just around yours. However, I found serval bugs in my annotation files so fixed them all, (float back to int, do not generalize, and x_min, y_min, x_max, y_max). Then I tried different learning rate. When score is set to 0, ok, got some feedback but useless. But better than before. and cancel the early stopping callback function.

next, try kmeans.py not use default anchor box size. maybe works?

boreas-l commented 5 years ago

my bbox input has been generalized before trained model so I changes type of bbox input from int to float.

But I got similar problem. Although I got loss 9.645, found nothing in test set. Even I tried my model in training set, still nothing has been found. Basically, using training set to test, there must be a result, right? I got confused too. @qqwweee could u pls offer some ideas?

hi, maybe you can check your predict file in detail, and print the boxes of genrate function, to check if the output format is in the expected.

remingm commented 5 years ago

I'm having the same issue. The weights appear to train well but output 0 confidence for all classes, even when running yolo_video.py --image on already-seen training data. Things I have tried:

  1. Lowered the score threshold in yolo.py as low as 0.0.
  2. Used random sampling to find the optimal batch size and learning rates for my data.
  3. Verified that train.txt is formatted correctly.
  4. Plotted the boxes from train.txt to verify that they are correct.
  5. Adjusted the number of epochs for training, early_stopping, and learning rate decay.
  6. Changed random=True to random=False when calling get_random_data() in data_generator() in train.py.
  7. Training without pre-trained yolo weights.
  8. Using kmeans.py to generate custom anchors. This ran for several days without any output, so I terminated it. Was that premature? I am using default anchors which has always worked for me in darknet.

Until I tried 7 and 8 together, I had only gotten scores of 0 for all classes. After disabling random image transformations, I am getting low scores between 0.01 and 0.03, but for classes that aren't present in the image. The boxes are also entirely wrong and appear to be centered on the image.

I will post if I figure this out, I'm still considering that I may be at fault somewhere. My training set has 1.7 million images so I believe there is adequate data. Can anyone report successful training?

I have trained in darknet before, I may resort to that and then convert the weights.

datduonguva commented 5 years ago

If you guys are opening image using cv2, not PIL, it would not work unless you order the color channel in the correct order by doing this:

im = cv2.imread(path)
im = im[..., ::-1]
remingm commented 5 years ago

Thanks, it looks like I'm using PIL during both inference (detect_img in yolo_video.py) and training (utils.py).

remingm commented 5 years ago
  1. Changed random=True to random=False when calling get_random_data() in data_generator() in train.py.
    After disabling random image transformations, I am getting low scores between 0.01 and 0.03, but for classes that aren't present in the image.

The slight improvement may have happened because this also enables letterboxing for training images. Letterboxing is done during inference but not training by default. See #319 for more.

remingm commented 5 years ago

Update: Raising the learning rate decay patience in train.py (ReduceLROnPlateau) to 10 and disabling early_stopping allowed training to reach a lower loss and I saw more responsiveness from the weights.

I also fixed a bug in my data pipeline that was only writing one box per image to train.txt. I am retraining now, but this should lead to further improvements.

remingm commented 5 years ago

Update: For the first time, I was able to train a custom model to accurately find objects, but only in the training set. This was after a short training run with only 1000 images and about 300 epochs to a training loss of 5. I deliberately overfit to a small set to try to get a response from the weights. Augmentation was off and the lr decay patience was 10. These two changes seem the most effective. I hope that these improvements will scale to my full 1.7M image train set and finally produce usable weights.

gpu-poor commented 5 years ago

@remingm I am having same issue, and tried disabling augmentations by get_random_data(..random=False) in train.py. After doing that I was getting train/val loss as 'nan'. Then I tried to change yolo_loss as mentioned (https://github.com/qqwweee/keras-yolo3/issues/171#issuecomment-522287613) Still no luck. any ideas how to resolve that ?

remingm commented 5 years ago

@anish9208 I could never get training to converge to a usable loss, and disabling augmentation led to another error (#51).

So I abandoned training with keras-yolo3 and converted my labels to be compatible with the original Darknet YOLO. Using Darknet, I was able to easily train to a loss of ~1 with the same data that keras-yolo3 struggled to train to a loss below 10. This was with the same batch size and learning rate.

I then used convert.py to convert the darknet weights to h5 and successfully ran inference in keras-yolo3. I recommend training in Darknet and then converting the weights to keras-yolo3 for inference.

halataa commented 5 years ago

I have the same issue im using this network for detecting words in document ,with loss of around 0.05 I got 0 box found for images in training set. (used tiny versin with defult anchors) any Ideas?

remingm commented 5 years ago

@halataa I trained to a similar loss and got 0 boxes found as well. Then I noticed that my box coordinates were in the wrong order. After fixing this and trying many other solutions (see above) I was able to get some boxes at inference, but they were unusable.

I would recommend training in Darknet and then converting the weights to keras-yolo3. This led to great results with the same data. See my prior comment.

halataa commented 5 years ago

@remingm thank you so much, I just found the problem and then see your reply :)

TechnoLenzer commented 5 years ago

@remingn Please could you give some detailed instructions as to how to train using the original Darknet?

remingm commented 5 years ago

@TechnoLenzer there are comprehensive instructions here: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

duany049 commented 5 years ago

I meet this question too, and i change this to resovle the problem change [[3,4,5], [1,2,3]] to [[3,4,5], [0,1,2]]

xiongmao5320 commented 4 years ago

@remingm thank you so much, I just found the problem and then see your reply :) @halataa hi, halataa, i met the same problem, how did you solve it? thanks.