thtrieu / darkflow

Translate darknet to tensorflow. Load trained weights, retrain/fine-tune using tensorflow, export constant graph def to mobile devices
GNU General Public License v3.0
6.13k stars 2.08k forks source link

No BoundingBoxes while using self trained data #473

Open kronnox opened 6 years ago

kronnox commented 6 years ago

First of all, I don’t know if this is the right place to ask for help when facing a problem, but I don’t really know where else to ask questions regarding darkflow.

First I tested darkflow on various images, videos, and live cameras with the pretrained weight-files and was really impressed with the performance it could deliver. However, I didn't succeed, no matter what I tried, to generate bounding boxes based on the data trained by myself. So far, I have only been able to run all test runs while using GPU-accelerated training, because pure training via the CPU would simply take too much time.

I have changed the following training parameters multiple times, hoping that they will at least provide some sort of output:

But none of these changes seemed to show any effect on the output. The JSON-Output always was empty, even on images used for the training itself.

At this point I even assumed a possible faulty installation and reinstalled darkflow + prerequisites not only on my current Windows 8.1 setup but also on a freshly installed Ubuntu 16.04 system.

I really don’t know what I’m doing wrong, or what I could try next. Thanks a lot!

——————————————— Ubuntu 16.04 LTS Nvidia GeForce 1070 cuda Toolkit 8.0 cuDNN 6.0 tensorflow 1.4.1 tensorflow-gpu 1.4.1 ———————————————

joaopalma5 commented 6 years ago

You have news? You can put here your command to see what you are doing? LR? Load from?

kronnox commented 6 years ago

First, I start the training process with the following command (or a similar one): flow --model cfg/tiny-yolo-voc-new.cfg --load bin/tiny-yolo-voc.weights --train --dataset "VOCdevkit/VOC2012/JPEGImages" --annotation "VOCdevkit/VOC2012/Annotations" --gpu 0.7

After the loss decreased to ~0.05 or even NaN, depending on how many steps I run it for, I'm stopping the training process by pressing ctrl+c.

Afterwards I 'flow the graph' with the following command: flow --imgdir sample_img/ --model cfg/tiny-yolo-voc-new.cfg --load -1 --gpu 0.7 --json

As a JSON-output it's always returning an empty list: []

joaopalma5 commented 6 years ago

Try this: https://github.com/thtrieu/darkflow/issues/80#issuecomment-303931206

And don't forget to say the result here!

kronnox commented 6 years ago

I'm actually kind of surprised right now. I already tried to overfit multiple times in the past; it somehow never worked out for me. As I just tried again with only one class and a dataset of five images, it showed reasonable results after only 16k steps.

Yeah, I guess my Issue is therefor resolved. I will probably continue overfitting until I am able to train on the whole VOC2012 dataset.

Thanks!

joaopalma5 commented 6 years ago

@Kronnox I don't know if this works well. I am running my train, if you can say something how are your results

kronnox commented 6 years ago

@joaopalma5 You're probably right that I'm still doing something wrong. After I tried to continue training on the whole dataset, with the previous overfitted data preloaded, the loss was at ~5 (which didn't seem wrong to me, because of the much bigger dataset than before). On the following 200k steps (~12h) a steady decrease was mentionable (the loss decreased to ~1.7). At about 210k steps it rapidly jumped from the ~1.7 to a constant NaN.

Afterwards I killed the training process to check on the results. Unfortunately I couldn't get any BoundingBoxes to show up.

How do I have to continue after overfitting? Or do I just have to train it for a longer period (even tho it seems useless to me since the loss is already NaN...)?

joaopalma5 commented 6 years ago

I think that you have to overfitting with 3~5 images, and test the results(with the same images of train) when you get nice results. You can move to a large dataset with all images.

kronnox commented 6 years ago

That was pretty much what I have done. After I overfit it with 5 images, it afterwards showed reasonable results on those images. Therefor I tried to continue with a lager dataset and trained until the loss moved to NaN. But the JSON-output is empty again, on all of my test images.

I have attached a TensorBoard screenshot to show my loss throughout the whole training-session. bildschirmfoto vom 2017-12-21 14-20-23

joaopalma5 commented 6 years ago

@Kronnox You do this: "I recommend disabling noise augmentation during this overfit step by setting argument allobj = None in https://github.com/thtrieu/darkflow/blob/master/net/yolo/data.py#L69, setting learning rate smaller (say 1e-5) and try overfitting again."

kronnox commented 6 years ago

@joaopalma5 Overfitting seems to work quite nice actually. I can get it to recognize the objects on the images I used for training and sometimes it even detects them on other images, not included in my training set. But I don't really know how I continue after overfitting. I want to train on a bigger dataset than just these 5 images per class.

How long do I have to overfit? Do I just change the dataset and continue training with the previous checkpoints preloaded?

offchan42 commented 6 years ago

I have this no bounding box problem as well. I train it on 20 images of license plates. Only one class. When I do the flow part, I even set the threshold to 0 to force the program to output all bounding boxes possible. But there is no bounding boxes at all. ./flow --model cfg/tiny-yolo-voc-lp.cfg --load 300 --labels data/labels.txt --imgdir data/car --json --threshold 0 The loss is about 1.0. Any ideas on how can I show all bounding boxes?

kronnox commented 6 years ago

@off99555 have you already tried to overfit? Just try it with 2 or 3 images and train until your loss stops decreasing. Now you should see detections on the images you used to train with (but most likely not on others!). Then just add one new image a time to the training set and start training again. Now just repeat this procedure and keep adding more images. Also feel free to add multiple images at once as long as you don't add too many at a time in relation to your current training dataset size (As a rule of thumb I always added half as many images as I previously used). After a while you may also see your first correct detections on other images that aren't in the dataset.

In my experience it can be fairly hard to get reasonable results with only one class, as your loss has to be really small before you will see any correct results. With one class I would say you will probably see results around loss ~0.1 or loss ~0.01 even tho it seems to be different every time...

As mentioned in #462 the right batch size and learning rate may also help to improve your results.

offchan42 commented 6 years ago

@Kronnox I saw the detection now by reducing the learning rate to 1e-5 and batch 20. The loss was around 0.5 Now the question is, how do I show all bounding boxes in the image, I want hundreds of them. I tried setting --threshold to 0 but it still shows the same number of bounding boxes.

kronnox commented 6 years ago

@off99555 So, just to clarify, you just got your detection of the license plates? Only on your test images itself or also on others?

Why would you want more detections other than the correct ones you trained it on? I don't quite get what you're stuck at.

offchan42 commented 6 years ago

@Kronnox I got the detection of the license plate, but its accuracy is not good. The bounding box covers too large, sometimes it covers the entire car. But that should be fixed by decreasing the loss further.

My concern is, why can't I report all "suspect" bounding boxes of the network. Why setting the threshold to 0 still does not show any bounding boxes. Isn't threshold 0 supposed to show all the suspect bounding boxes in the image? Example: http://machinethink.net/images/yolo/Scores@2x.png Am I understanding the threshold incorrectly?

ron-weiner commented 6 years ago

@Kronnox @off99555 I got the same issue and I don't understand what do I miss here. I attached the training log. I reached "moving ave loss 2.7222160140453253" No matter which checkpoint i take, and which threshold i choose, the are not boundary boxes at all.

I am trying now the "over fitting" approach as suggested, starting with a only few images and then add more, time after time.

But still, i don't understand why it happens this way.

tiny-voc-class11.txt

kronnox commented 6 years ago

@ron-weiner Overfitting should at least get you some bounding boxes as a starting point. You may also want to try varying the learning rate and/or the batch size. A higher learning rate should in theory result in a faster learning progress but also in more inaccurate results. Since I reached a loss of NaN quite often during my training sessions, I chose a lower learning rate and a smaller batch size to avoid the problem. Unfortunately, the only way (at least as far as I know) to find the right values for your project is to try it out on your own.

//EDIT: And of course try a lower threshold in the beginning while flowing the graph. For example --threshold 0.3. It may also result in more false positives. As soon as you get reasonable results, just increase the value again.

ron-weiner commented 6 years ago

Thank you @Kronnox Do you know the mathematical reason that causes "no bounding boxes" situation? I think that as long as there is no clear answer regarding it, our handling is just a shot in the darkness... For now, I'm training the model with a small dataset, (6 images), then I'm feeding it with additional 4 and etc - it is working until now but of course that it is not effective nor accurate enough as a whole...

kronnox commented 6 years ago

@ron-weiner I'm also kind of interested in why this problem is occurring. I don't quite understand why we are seeing no BBoxes at all, rather than seeing a lot of false positives... Perhaps I'll have a closer look into it in the future.

As far as I could observe it is possible to add several hundred pictures at once after overfitting for a while. Yet not the efficiency I would like to see. As mentioned somewhere above it seems to work out quite nicely if you always double the amount of images in your dataset.

ghost commented 6 years ago

hello, i wrote my yolo based on longcw and thtrieu, but after training network's output are 0 and inf of (box and iou prediction) and 1 class with prob always 1. no bounding boxes. do you think what problems occured? thanks!

zsfarkas commented 6 years ago

+1

vladinator1000 commented 6 years ago

I'm also seeing this problem with tiny yolo, no bounding boxes whatsoever.

liuhantao9 commented 6 years ago

@Kronnox I am having avg loss to nan after 17k steps. Although I have bounding boxes after prediction, the confidence is very low, something around 0.3. Do you know the reason?

kronnox commented 6 years ago

@Kronnox I am having avg loss to nan after 17k steps. Although I have bounding boxes after prediction, the confidence is very low, something around 0.3. Do you know the reason?

@liuhantao9 Not really. I can only advise you to follow the steps I mentioned in the posts above. I myself haven't worked with Darkflow for quite a while, so I'm probably not up to date anymore.

Sagartansar commented 5 years ago

i have some issue using darflow i trained my model upto 6500 steps and the loss i have now is ~1 to 2, and now i am trying to detect the object. the problem is with a threshold value 0.1 or above it is not detecting any thing in image. and with threshold value 0.01 or below the image is filled with boxes. i have 3 classes and i am using yolov2.cfg file. is there any solution regarding this?? need help...

`[net]

Testing

batch=1 subdivisions=1

Training

batch=64

subdivisions=8

width=608 height=608 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1

learning_rate=0.001 burn_in=1000 max_batches = 500200 policy=steps steps=400000,450000 scales=.1,.1

[convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky

#######

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[route] layers=-9

[convolutional] batch_normalize=1 size=1 stride=1 pad=1 filters=64 activation=leaky

[reorg] stride=2

[route] layers=-1,-4

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=40 activation=linear

[region] anchors = 0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828 bias_match=1 classes=3 coords=4 num=5 softmax=1 jitter=.3 rescore=1

object_scale=5 noobject_scale=1 class_scale=1 coord_scale=1

absolute=1 thresh = .1 random=1`