thtrieu / darkflow

Translate darknet to tensorflow. Load trained weights, retrain/fine-tune using tensorflow, export constant graph def to mobile devices
GNU General Public License v3.0
6.13k stars 2.09k forks source link

No ouput boxes after training !! #80

Open khorchefB opened 7 years ago

khorchefB commented 7 years ago

Hello,

Currently I am trying to train the yolo.cfg (version 2) with 2 labels. (I want to recognise Dark Vador and Yoda in my test images.) I changed the number of classes in yolo.cfg, and I renamed it yolo-5C.cfg. So I put 2 labels in labels.txt, I created the annotation files, and finally I started the training using CPU with this command:

./flow --model/yolo-5C.cfg --load bin/yolo.weights --dataset pascal/VOCdevkit/IMG --annotation pascal/VOCdevkit/ANN --train --trainer adam

I changed the following parameters in the file flow.py:

There are 120 images (40 images with only Dark Vador ,40 images with only Yoda and 40 images with both of them) and 120 annotations

My problem is that after 12 hours of training on cpu, and after having started the test with the --test argument, it displays NO BOXES in the output images. But when I decrease the threshold to 0.00001, it displays many boxes. I want to understand how can I improve my training to have correct object detections. Can you give me please some advices.

Thanks.

thtrieu commented 7 years ago

I see you are doing YOLOv2. How much is the loss? I suspect yours has not converged.

hyzcn commented 7 years ago

@thtrieu Hi! I also train a two class YOLO v2 on my dataset, which has around 50000 images. I use the same setting as @kamelbouyacoub , and I trained with the pre-trained imagenet weights download from darknet website. At first, the loss decrease rapidly in around 10 epoches, then it stays around 1.8 ~ 2 and didn't decrease any more, the learning rate at 1e-6 for those epoches. I wonder how long it usually takes to converge? what's a normal loss like can give meaningful output? Could you kindly give some reasons or improvement suggestions? Thx!

khorchefB commented 7 years ago

I retrained my graph for a second time, and here what it display to me after 800 iteration with learning rate equal to 1e-2

capture

Please, I need help, can you give me some advice for training.

Thanks

thtrieu commented 7 years ago

Currently I am trying to train the yolo.cfg (version 2) with 2 labels. (I want to recognise Dark Vador and Yoda in my test images.) I changed the number of classes in yolo.cfg, and I renamed it yolo-5C.cfg. So I put 2 labels in labels.txt, I created the annotation files, and finally I started the training using CPU with this command ./flow --model/yolo-5C.cfg --load bin/yolo.weights --dataset pascal/VOCdevkit/IMG --annotation pascal/VOCdevkit/ANN --train --trainer adam

If you want to work with 2 labels, then there are two modifications have to be made in .cfg: [region].classes = 2 and the last convolutional layer's filter number (should be 35 instead of 425).

Make sure you did the above, then please avoid training right away. First, train on a very small dataset (3~5 images) of both classes. Only when you successfully overfit this small dataset (an inexpensive end-to-end test for the whole system), then move on to training on your whole dataset.

If overfitting fails, I'll help you look into the details.

hyzcn commented 7 years ago

@thtrieu Hi! I'm another poster with similar issues as mentioned in previous posts. I already change my class number to 2 classes and try to overfit the net with around 8 images, the loss can converge a bit lower but then it still get stuck around 1.6. I wonder these 3-5 images you mentioned is randomly drawn or there is any guidelines? Moreover, the loss of successful overfitting is around 0? Or any magnitude to indicate successful overfitting? I have been trapped for a few days and thanks in advance for your reply!!

thtrieu commented 7 years ago

In my experiments, the overfitting loss can be around or smaller than 0.1. In the case of disabling noise augmentation, it can very well be near perfect 0.0.

3-5 images can be anything (randomly drawn from training set is possible), but preferably contains all of your classes (e.g. car and dogs, then 3-5 images should better have both of them instead of only one). Not being able to overfit such a small training set means the learning rate are too big; or there is bug in the code.

I recommend disabling noise augmentation during this overfit step by setting argument allobj = None in https://github.com/thtrieu/darkflow/blob/master/net/yolo/data.py#L69, setting learning rate smaller (say 1e-5) and try overfitting again.

hyzcn commented 7 years ago

@thtrieu thanks for the information, I'll try on that! :+1:

andreapiso commented 7 years ago

I am retraining yolov2 on VOC 2012 with 20 classes and did not change any parameter. Loss is now at 0.01 and still cannot see any bounding box after 7000 steps. Should I just keep training or is this the sign there is an issue?

Dref360 commented 7 years ago

Have you looked at postprocess in net/yolo/test? There is a _tresh dict that may disrupt your output. I had to remove it to make it work

thtrieu commented 7 years ago

@Dref360 that dict is removed in newer versions, please update your code

@AndreaPisoni Please give the steps to reproduce your error.

hemavakade commented 7 years ago

Hi I am trying to train YoloV2 on my different dataset. I have created an annotation file as per PASCAL VOC format. I am trying to identify shoes and bags in the images. As suggested by users ( @ryansun1900 , @y22ma, @thtrieu ) on this repo I used 3-5 images and annotations to train.

I used tiny-yolo-voc.weights and tiny-yolo-voc.cfg. I changed tiny-yolo-voc.cfg for the number of classes and the filters in the last convo layer, as 2 and 35 respectively.

I used a learning rate of 1e-3.

This is the command I used to train,

./flow --train --trainer momentum --model cfg/tiny-yolo-voc-2c.cfg --load bin/tiny-yolo-voc.weights --annotation <path/to/annotation> --dataset <path/to/sampledata> --gpu 0.4

After I ran 200 epochs I got NAN in loss and moving ave loss. I printed out the output matrices while training using

fetches = [self.train_op, loss_op, self.top.out, self.top.inp.out, self.top.inp.inp.out, self.top.inp.inp.inp.out]

I looked for matrices which had values in them and found some values around step 176, so I loaded that model and reran the training with a smaller learning rate= 1e-6. I finally managed to reduce the loss 4.600135803222656 - moving ave loss 4.5986261185381885. I tried to test using the ckpt with the following command,

./flow --test <path/to/test/> --model cfg/tiny-yolo-voc-2c.cfg --load 890

But the images do not have bounding boxes.

Can you please guide me . I am not sure if I have missed any step in between.

thtrieu commented 7 years ago

I think you are doing fine. Just that the model has not converged. A trained voc model with 20 classes has loss around 4.5; so two classes should be significantly smaller than that.

And you are doing it with only 3-5 images, so I would say overfitting should be the case, i.e. loss << 1.0.

hemavakade commented 7 years ago

@thtrieu, what do you suggest in that case.

I have also disabled noise augmentation during the over-fitting.

## Update: I could bring down the loss to almost 0.01. Had to use a a different optimizer; RMSPROP works better. But when I test, there are still no bounding boxes. This is the command I am using.

./flow --test <path/to/test/> --model cfg/tiny-yolo-voc-2c.cfg --load -1 --gpu 0.4

I checked the output of the box probabilities and they are very low, in the order of < 1e-3. For the purpose of testing if I am doing everything right, I used the same images I trained on as my test data and it did put the bounding boxes and the values of probabilities are also high around 0.9. Do you suggest training on a larger dataset using the overfit model?

eugtanchik commented 7 years ago

I have the same problem training on my own toy dataset with 2 classes model. Training process converges according to loss function decreasing, but draws nothing during testing. What I am doing wrong?

hemavakade commented 7 years ago

Update: I got it working! I have bounding boxes. I used yolo.weights and yolo.cfg. I think this is trained on COCO dataset which is much better for the dataset and classes I am using.

eugtanchik commented 7 years ago

@hemavakade, Obviously, I have boxes with yolo.weights and yolo.cfg too. But I want it to work with my own dataset under darkflow to be able to make fine-tuning of the model further.

hemavakade commented 7 years ago

@eugtanchik I am not sure I understood you. I loaded the yolo.weights but used it to overfit my dataset. Do you mean to say yolo.cfg and yolo.weights are not in yolo - v2?

eugtanchik commented 7 years ago

@hemavakade, I mean that yolo.weights are trained on darknet framework, or am I wrong? Sure, this is YOLOv2, but what about number of classes in your case? It is not clear for me what to do, if my classes are not included in COCO dataset. As I know, in darkflow there is not any way to get yolo.weights, only tensorflow model format or protobuf.

hemavakade commented 7 years ago

@eugtanchik well I have more classes. I was trying to get it work with a small number of classes.

To train further for other classes, I will try the following options.

eugtanchik commented 7 years ago

@hemavakade, Maybe this is a good idea. But it must be the way to train any model from scratch without pre-training. It seems for me that there is some bug in the code. I have not found it yet.

eugtanchik commented 7 years ago

My problem was fixed by just more number of steps were finished, and I saw some detections. It works fine!

dkarmon commented 7 years ago

solutions suggested here didn't solve my problem. I used pre-trained weights to train my model on a different dataset with fewer classes. During the training process, the loss decreased and converged at some point. Afterward, I tried testing to output model on both test and train dataset and in both cases, there are no bounding boxes.

Please advise!

nattari commented 7 years ago

I am facing similar issue. I trained on own dataset with 3 classes using pre-trained imagenet model i.e. darknet19_448.23 for yolov2.. I do not see any bounding boxes. I am using default setting but is there any role of anchor box parameter that need to be updated depending on your data. Any help in this context would be very useful!

denisli commented 7 years ago

Same issue.

Here are the steps I took: I copied tiny-yolo-voc.cfg file to yolo-new.cfg file. Although I am really looking for 6 classes, I am training for 20 since I could not figure out how to change the number of classes without causing tensors to be inappropriately sized. I was training from scratch and reached a loss of 0.6 or 0.7.

When testing with both the training set and testing set, there were no bounding boxes.

If someone could advise how to change from 20 classes to 6 classes, that would be appreciated as well.

nattari commented 7 years ago

It worked for me. It is relatively easy in Yolov2 to change the config file to incorporate your data (no additional changes). You need to train for more iterations. Initially, I wasn't detecting any bounding but after training for 40k iterations, I finally could see detection though the result was poor (you need to tune anchors). I used pre-trained imagenet weights.

thtrieu commented 7 years ago

I'll reopen this issue since a lot of people are complaining about it. However it is worth noting that different users have different experience while training. Some succeeded, some did not. Please share your experience here.

For me, the absolute thing to do before training is to overfit the network on 3-5 images from random weights. Only when you are able to obtain reasonable detection, proceed to train until convergence. Convergence is a tricky concept, in many cases, loss stop decreasing does not mean convergence.

In YOLOv2 output, there are x, y, w, h representing coordinates, c for objectness and a probability distribution over classes, repeat all for B boxes on S x S grid cells. Take all of these into account when calculating your expected convergence loss. I may develop some feature to help evaluating such value.

Only when you have a rough estimate of such value, then you are able to declare convergence with confidence. Otherwise, it is highly likely that your net simply got stuck.

denisli commented 7 years ago

I did not mention this, but after running for 40,000 iterations, I got bounding boxes as well. Although I did run it for only a single class this time. It seems that it just takes more training to get bounding boxes.

jasag commented 7 years ago

@denisli And did you notice any change in the error before observing that you already obtained bounding boxes? Or did you let it training a lot of iterations without any signal of working?

denisli commented 7 years ago

@jasag Yes, good question. It pretty much changes from oscillates slowly from 1.0 to 1.4 now. It would have been like this at like 10,000 iterations as well, I think.

I might have jumped the gun when I said that you needed more iterations to see bounding boxes. I was training with 6 classes before. Now I am training on just a single class. That might have made a difference. Unfortunately, I didn't bother to check this time at 10,000 iterations and had already deleted those checkpoints, so do not know if it would have shown bounding boxes then.

I encourage that you try more iterations anyway. I will run with my original 6 classes and let everyone know how it goes in about a week or so.

denisli commented 7 years ago

Here it is:

The task is to detect traffic lights into 6 classes: green, yellow, red, green left, yellow left, and red left. The bounding boxes are all quite small. I used basically the same configuration as from yolo-new.cfg, but changed it so that it would handle 6 classes instead of 20. The results are shown below.

step 1001 - loss 95.0399169922 - moving ave loss 95.0399169922
step 2001 - loss 69.532623291 - moving ave loss 69.532623291
step 3001 - loss 39.0838623047 - moving ave loss 39.0838623047
step 4001 - loss 17.3892593384 - moving ave loss 17.3892593384
step 5001 - loss 6.25845241547 - moving ave loss 6.25845241547
step 6001 - loss 2.90494060516 - moving ave loss 2.90494060516
step 7001 - loss 1.37821388245 - moving ave loss 1.37821388245
step 8001 - loss 1.00944340229 - moving ave loss 1.00944340229
step 9001 - loss 2.86516594887 - moving ave loss 2.86516594887
step 10001 - loss 6.4756526947 - moving ave loss 6.4756526947
step 11001 - loss 2.20858240128 - moving ave loss 2.20858240128
step 12001 - loss 2.07890105247 - moving ave loss 2.07890105247
step 13001 - loss 2.71276283264 - moving ave loss 2.71276283264
step 14001 - loss 3.06041097641 - moving ave loss 3.06041097641
step 15001 - loss 2.01983118057 - moving ave loss 2.01983118057
step 16001 - loss 3.11811351776 - moving ave loss 3.11811351776

I tested on some of these checkpoints:

And it probably gets better from here on up to a certain point.

The conclusion is that lower loss does not necessarily mean that it will have bounding boxes. The more iterations you run it, the more likely you will get bounding boxes.

khorchefB commented 7 years ago

I don't understand that, why sometimes at lower loss, the result is not good ? it's strange !!

jasag commented 7 years ago

I comment a few weeks ago that I was not able to get bounding boxes on another issue. And I am realizing that I do not question the size of the images, which are high definition, in training and in prediction. How should I parameterize my model for the object recognition in images of this type? Because I suppose that there may be some kind of influence, or am I wrong?

denisli commented 7 years ago

@jasag My dataset uses 640x480 .png files.

minhnhat93 commented 7 years ago

I got this issue too. Trained on a whole dataset of 60000+ images with 2 classes using both finetuning and transfer learning. The loss went down very fast to ~0 after a few hundred iterations. Batch size is 8.

step 17795 - loss 3.470731542165595e-07 - moving ave loss 3.4845578592865537e-07
step 17796 - loss 3.454373427302926e-07 - moving ave loss 3.4815394160881907e-07
step 17797 - loss 3.467374085630581e-07 - moving ave loss 3.48012288304243e-07
step 17798 - loss 3.46648846516473e-07 - moving ave loss 3.47875944125466e-07

However, when I test the model on both the train and test data, there are no bounding boxes. Please advise. The command I used to run test is: flow --model cfg/yolo-idot.cfg --load -1 --gpu 0.0 --imgdir /home/nhat/IDOT-convert/IDOT_dataset/train/frames --labels idot-labels.txt Another thing that may be relevant is that my training set comes a from a video so there are many duplications in the training set. Edit: It seems people are having similar issue at #142 too.

abhiishekpal commented 6 years ago

We have to set the minimum score threshold in order to see the bounding box.

JaySinghh commented 6 years ago

@hemavakade and @denisli could you please guide me on how you train your own dataset and got the bounding box in test.Let me know if you can share your view on the below screen. screenshot from 2018-01-25 15-20-42 Thanks in advance.

onurbarut commented 6 years ago

@JaySinghh hey, are you using darknet? As far as I know darkflow doesn't provide the above terminal information flow. If it is darkflow, can you share how you managed it? Cuz I need to learn about IOU and recall rate.

SamNew1 commented 6 years ago

I used yolov3 pre-trained model to train my own dataset, I found it can detect the target before 900 iteration, but it cannot detect any target after 10000 iterations including the final weight, do you know what should I do? should I change the training rate when it is training?

sharoseali commented 6 years ago

Kindly please any one please.. who can give me the links of correct Yolov2-voc.cfg and its corresponding weights file i start training with the ine downloaded from offical yolo site but i got this error.. help me please thanks in advance........This is error

C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\dark\darknet.py:54: UserWarning: ./cfg/yolov2-voc.cfg not found, use cfg/yolo2-voc-1c.cfg instead cfg_path, FLAGS.model)) Parsing cfg/yolo2-voc-1c.cfg Loading bin/yolov2-voc.weights ... Traceback (most recent call last): File "flow", line 6, in cliHandler(sys.argv) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\cli.py", line 26, in cliHandler tfnet = TFNet(FLAGS) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\net\build.py", line 58, in init darknet = Darknet(FLAGS) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\dark\darknet.py", line 27, in init self.load_weights() File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\dark\darknet.py", line 82, in load_weights wgts_loader = loader.create_loader(args) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\utils\loader.py", line 105, in create_loader return load_type(path, cfg) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\utils\loader.py", line 19, in init self.load(args) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\utils\loader.py", line 77, in load walker.offset, walker.size) AssertionError: expect 202314760 bytes, found 202704264

youyuge34 commented 6 years ago

@sharoseali It is because some .weights files on the official darknet website has been updated. The offset about how to read the file has been changed. To fix it ,i strongly recommend that u should use the yolo.cfg and yolo.weights together which is trained by coco dataset. Or use the tiny-yolo-voc.cfg together with tiny-yolo-voc.weights trained by voc2007. These two sets are correct by my own test. (PS: my yolo above means your yolov2 model) Otherwise, u should change the source code about the .py which reads and analyse the .weights file. Change the offset 16 to 20 (or 20 to 16).But it may cause unknown side effects.

sharoseali commented 6 years ago

yes the file exists .. this is the issue the even a file exists it show me this message however i give full path and darkflow respond me like this

WARNING:tensorflow:From C:\Users\MIPRG-P2\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Use the retry module or similar alternatives. Parsing C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\cfg\tiny-yolo-voc-1c.cfg Traceback (most recent call last): File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\realtimeDetection.py", line 12, in tfnet = TFNet(options) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\net\build.py", line 64, in init self.framework = create_framework(*args) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\net\framework.py", line 59, in create_framework return this(meta, FLAGS) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\net\framework.py", line 15, in init self.constructor(meta, FLAGS) File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\net\yolo__init__.py", line 20, in constructor misc.labels(meta, FLAGS) #We're not loading from a .pb so we do need to load the labels File "C:\Users\MIPRG-P2\Desktop\dark\darkflow-master\darkflow\net\yolo\misc.py", line 36, in labels with open(file, 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: 'labels.txt' Loading None ... Finished in 0.0s [Finished in 5.002s]

sharoseali commented 6 years ago

youyuge34 thanks for replying.. youu mention yolo.cfg ang yolo.weights file where can i get this ? from old yolo site and what type of results can i expect from this model

Actually.. i want to train with yolo v2 or yolo v3 how can i do that... may be i an shift to linux ?? wts your opinion??

youyuge34 commented 6 years ago

@sharoseali

FileNotFoundError: [Errno 2] No such file or directory: 'labels.txt' 

If u train your own .cfg, then u should mannully add the 'label.txt' file in the root dir. Otherwise, ff u review the source code in darkflow-master\darkflow\net\yolo\misc.py, u will find that if you use the origin .cfg file, it will auto load the cfg/coco.names as label.txt. So u can simply add your own cfg name into the list in the misc.py. Read the source code, feel free to change it.

youyuge34 commented 6 years ago

@sharoseali It looks like u r not familiar with this project. The yolo in this project Darkflow/cfg means yolov2. And Darkflow/cfg/v1 means to yolov1. The .cfg files are all exist originally. The .weights file can be get at the darknet official website: https://pjreddie.com/darknet/yolov2/

In detail, the yolo.weights Refer to this one: YOLOv2 608x608 | COCO trainval the tiny-yolo weights i mentioned above means this one: Tiny YOLO | VOC 2007+2012

sharoseali commented 6 years ago

youyuge34 you mention to edit the list . kindly please mention which list u are asking for.?

this one 1 labels20 = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

or this 2- voc_models = ['yolo-full', 'yolo-tiny', 'yolo-small', # <- v1 'yolov1', 'tiny-yolov1', # <- v1.1 'tiny-yolo-voc', 'yolo-voc']
thanks again

sharoseali commented 6 years ago

okay .. i will try the files you mention .. i thought that i tried this one in the past YOLOv2 608x608 but again it miss matched with cfg...... however i will try again .. please reply to the question regarding which list in misc.py you are asking for??
thanks youyuge34 for your help... expecting more help from u...

youyuge34 commented 6 years ago

@sharoseali if u r training your own dataset, u must add your own 'labels.txt'. If trainging voc,add your cfg name into the 'voc_models' list.

sharoseali commented 6 years ago

yes i am training my own dataset with only 1 class .. i have add my label name in 'labels.txt' file in darkflow-master folder and mention my cfg file in misc.py but getting same thing can i add my label name in this one ?? labels20 = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

youyuge34 commented 6 years ago

@sharoseali It feels like that i confused u. Just add the label.txt and leave the source code unmodified.

sharoseali commented 6 years ago

where i can add labels.txt??