Open sonalambwani opened 6 years ago
No you don't need to change your training set. You need to calculate your anchors as previously on yolo2 but multiply by 32 (and round). Then split the anchors among the layers. If you have 9 anchors you can split them 3 ways, but decide based on size. Each anchor should have 5+number of objects filters. I got ok results with the default anchors but you could recompute. Remember your anchor calc should be the same scale as the input size for the network.
@sonalambwani Just wait about 1000 iterations, and nan
will disappear: https://github.com/AlexeyAB/darknet/issues/504#issuecomment-377290060
width=416
and height=416
:
darknet.exe detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416
This anchors you can use in your cfg-file (without multiplication by 32)
@AlexeyAB ,hello,but after waiting about 1000 iterations, and nan still appear:
Hi, I am trying to do Training YOLO on VOC.
below command i ma using , ./darknet detector train cfg/voc.data cfg/yolov3-voc.cfg darknet53.conv.74
But nans keep on increasing. Is it normal or some issue. Loaded: 0.000063 seconds Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: nan, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: nan, .5R: -nan, .75R: -nan, count: 0 3296: -nan, nan avg, 0.001000 rate, 0.416401 seconds, 3296 images
I have a same issue. The error is shown in second yolo layer. Did you solve this problem?
same tooooo
@ss199302
If there are only some nan
then training goes well, but if there are all nan
then training goes wrong.
@AlexeyAB As you suggested, I am now training with my new dataset with the default COCO anchor boxes. I am training from "Scratch", i.e., no initialization with the pretrained convolutional weights as you have done in https://github.com/AlexeyAB/darknet/issues/504#issuecomment-377290060
For me, I see nans even after 2500 iterations. The loss (after starting off really high) has dropped within a reasonable range, but there is more of a fluctuation between the loss for each mini-batch.
Have you, or anyone else here, noticed similar behavior?
For me, I see nans even after 2500 iterations.
All lines have nan
values or only some lines?
How many classes and images in your dataset? And what tool did you use for labeling?
What batch and subdivision do you use?
Do you use random=1
?
Do you train using multi-GPU?
It's just a few lines with nans.
Used an in-house tool for labeling.
batch=16, subdivisions = 16
Not sure about random=1. Where do I check/set that??
It's a single GPU.
@AlexeyAB "How many classes and images in your dataset? And what tool did you use for labeling?"
5 classes, ~17k images in the training set.
@sonalambwani Looks like normal output of training.
You have batch and subdivision 16. That means one image per iteration and depending on the density of objects in your images, it's possible that no object will be found in a given layer which will lead to nan. Also depends if the ground truths are similar to the anchors. If they are all very small for all very large then you may not detect them in the very large or very small layers.
So I agree with alexeyAB that this looks normal. Can you reduce the subdivisions so more images per mini batch.
I have a same issue batch=64, subdivisions = 8 I have followed this instruction I really didn't understand if I should change anchors in yolo-obj.cfg when i have own dataset.
@ndg123 Thank you for your suggestions. I am now testing with batch = 64 and subdiv=16. Right off the bat, I see fewer nans. There are a few, but it's looking better.
per my training on customer dataset, If not all of them are nans, it is fine. since 3 different scales, that means in some scale, no object is detected. you may try different input image size, or divide into 2, or 4 different scales instead of 3? then the number of nans should change.
@AlexeyAB thanks your reply,but i can't test anything
@AlexeyAB darknet.exe detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416,this command how to write in ubuntu darknet?
@ss199302 ./darknet detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416
I am trying to run calc_anchors in linux using what @TheMikeyR says and it returns to command line immediately and gives no output. Is it supposed to print the anchors to std out? I'm new to C. Where can I find the code this command runs?
Also, I'm training on my own data, and the bounding boxes in my training data are all the exact same size, and they are all squares. Do I still need to specifiy more than one anchor?
Is it possible to detect the Signature (or any handwritten area) in printed receipts using YOLO ? which would be the best cfg file for the same, and any suggestions before I start ?
@brieh try Alexey repo https://github.com/AlexeyAB/darknet Here is the code https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L839
@TheMikeyR Thanks. I was using the pjreddie fork.
@AlexeyAB hello! I use this command ./darknet detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416 to get anchors,but it don't return anything.
@ss199302 same for me. Have u found the solution?
@ss199302 @spenceryue97 did you create the labels (*.txt) files first?
@ss199302 @spenceryue97 and you're definitely using AlexeyAB's fork?
I never ended up getting it working. I didn't want to switch to AlexeyAB's fork because we've modified our fork of pjreddie's fork. I tried copy/pasting the code that does the clustering from AlexeyAB's detector.c to the one I have and remaking, but still gave no output.
@sonalambwani Yes
@brieh I'm using pjreddie's repo
@spenceryue97 @brieh you can just get AlexeyAB's fork, run the calc_anchors and then take the numbers to your cfg in pjreddie's repo.
@AlexeyAB Can you tell me when i use recall ,why my IOU appear nan,and recall and precision is 0.5% ,thanks!
@UgolUgol Have you compare the result between default anchors and the anchors calculated with the command from https://github.com/AlexeyAB/darknet?
@AlexeyAB I am training my own objects, and am weirdly getting valudes for all Region 106
results and -nan
for everything else:
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001404, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000535, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: 0.183575, Class: 0.167765, Obj: 0.002698, No Obj: 0.000716, .5R: 0.000000, .75R: 0.000000, count: 1
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001364, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000517, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: 0.112761, Class: 0.219895, Obj: 0.001320, No Obj: 0.000692, .5R: 0.000000, .75R: 0.000000, count: 4
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001196, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000538, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: 0.518243, Class: 0.616705, Obj: 0.000801, No Obj: 0.000739, .5R: 1.000000, .75R: 0.000000, count: 1
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001336, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000534, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: 0.067241, Class: 0.113757, Obj: 0.002734, No Obj: 0.000756, .5R: 0.000000, .75R: 0.000000, count: 7
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001470, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000540, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: 0.064037, Class: 0.159617, Obj: 0.005763, No Obj: 0.000764, .5R: 0.000000, .75R: 0.000000, count: 5
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001454, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000550, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: 0.092829, Class: 0.161946, Obj: 0.004937, No Obj: 0.000723, .5R: 0.000000, .75R: 0.000000, count: 8
813: 6.122332, 6.256896 avg, 0.000437 rate, 8.432507 seconds, 26016 images
It's the consistency that's worrying me. I've got 16 classes with around 4500 images. The one particularly odd thing about my setup is that I've set the height and width for every identified object to 0.01
(e.g. 2 0.808552 0.933797 0.01 0.01
), as I only care about the position, not the bounds of the object. Hopefully that's not messing things up?
@jfries289
every identified object to 0.01
so you will get nan
for Region 82 and 94 always, but it isn't a problem. Training goes well.
But for slightly better accuracy, even if you need only position, it's better to set the real width and height of the objects, so Yolo will know which of 3 [yolo]
layers (with higher receptive filed, or with higher resolution without subdiscretization) should be used to detect this object.
@AlexeyAB Thanks for the reply. It's clear I need to improve my understanding of the regions, etc. However, I'm still not sure my training is going successfully:
2612: 3.408263, 1.774134 avg, 0.001000 rate, 12.983424 seconds, 83584 images
My avg
seems to be oscillating between 1.2 and 1.7. At this stage, I would have expected my avg
to be lower. Is this the system temporarily stuck in a local minimum, or has something possibly gone wrong?
@jfries289
My avg seems to be oscillating between 1.2 and 1.7. At this stage, I would have expected my avg to be lower. Is this the system temporarily stuck in a local minimum, or has something possibly gone wrong?
I think this is because Yolo can't select the optimal [yolo]-layer (1 of 3), so the last [yolo]-layer predicts objects with the big error and it increases loss, also the difference between size that predicted by Yolo during training and size that you set is very large. Also may be something wrong else. I think you will able to detect objects, but with low accuracy.
I recommend you to set real sizes for object using Yolo_mark, then recalculate anchors and then start training from the begining.
In the Yolo v3, the labels with correct sizes of objects help to choose the optimal [yolo]-layer, i.e. help to train with higher accuracy.
@AlexeyAB
In the Yolo v3, the labels with correct sizes of objects help to choose the optimal [yolo]-layer, i.e. help to train with higher accuracy.
If full-sized labels are not an option, would it be better for me to use Yolo v2? Or would I have the same issue there?
@jfries289
What is the range of the real sizes of objects in your dataset?
@AlexeyAB I would guess anywhere from 0.1 to 0.8.
@jfries289
In the Yolo v3, the labels with correct sizes of objects help to choose the optimal [yolo]-layer, i.e. help to train with higher accuracy.
If full-sized labels are not an option, would it be better for me to use Yolo v2? Or would I have the same issue there?
But I have never tested training using such dataset as your with the constant values of width and height.
I have a dataset of 21k face images. I already checked the labelled data using yolo_mark. I am using yolov3 with batch 64 and subdivisions 16 I am getting nan every where shall I wait for 1000 iterations? here is the output: 73: -nan, -nan avg loss, 0.000000 rate, 649.007621 seconds, 4672 images Loaded: 0.000000 seconds Region 82 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 4 Region 94 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1 Region 106 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 1 Region 82 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 2 Region 94 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 4 Region 106 Avg IOU: -nan(ind), Class: -nan(ind), Obj: -nan(ind), No Obj: -nan, .5R: -nan(ind), .75R: -nan(ind), count: 0 Region 82 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 3 Region 94 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1 Region 106 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 4 Region 82 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 4 Region 94 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 2 Region 106 Avg IOU: -nan(ind), Class: -nan(ind), Obj: -nan(ind), No Obj: -nan, .5R: -nan(ind), .75R: -nan(ind), count: 0 Region 82 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 3 Region 94 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R:0.000000, count: 3 Region 106 Avg IOU: -nan(ind), Class: -nan(ind), Obj: -nan(ind), No Obj: -nan, .5R: -nan(ind), .75R: -nan(ind), count: 0 Region 82 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 3 Region 94 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 3 Region 106 Avg IOU: -nan(ind), Class: -nan(ind), Obj: -nan(ind), No Obj: -nan, .5R: -nan(ind), .75R: -nan(ind), count: 0 Region 82 Avg IOU: -nan(ind), Class: -nan(ind), Obj: -nan(ind), No Obj: -nan, .5R: -nan(ind), .75R: -nan(ind), count: 0 Region 94 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 6 Region 106 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 2 Region 82 Avg IOU: -nan, Class: nan, Obj: -nan, No Obj: -nan, .5R: 0.000000, .75R: 0.000000, count: 5 Region 94 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1 Region 106 Avg IOU: -nan(ind), Class: -nan(ind), Obj: -nan(ind), No Obj: -nan, .5R: -nan(ind), .75R: -nan(ind), count: 0
@AlexeyAB As you mentioned earlier in Only if nan occurs for avg loss for several dozen consecutive iterations, then training went wrong. Otherwise, the training goes well., can you please give some solutions on how to correct the training process if we are getting all nans? My training loss goes on increasing and after some steps, all values become -nan.
I encounter this phenomenon on my 3 classes dataset, but after training, it goes well. I think it is from the scale mismatch of different output layer.
When no object found in the given layer it gives nan. IOU is basically Area of intersection / Area of Union. I think that looks normal.
@AlexeyAB how many images do you think I should get if I want to add a new class to the COCO dataset?
Hello @MizbaMohammed ,
Are you already aware of the default recommendation? I guess this is not very specific for COCO, but in general: https://github.com/AlexeyAB/darknet#how-to-improve-object-detection
desirable that your training dataset include images with objects at diffrent: scales, rotations, lightings, from different sides, on different backgrounds - you should preferably have 2000 different images for each class or more, and you should train 2000*classes iterations or more
I got confused when set up the anchor boxes: how should I arrange the sequence of the clustered anchor boxes? You know, they're not distributed well as we wish...... I might get [1, 1, 2, 2, 5, 6, 30, 32, 42] instead of [1,2,3,4,5,6,7,8,9], and I hesitated to just put them evenly at 3 scales in yolov3. And experiments of myself have just proved that the arrangement of anchor boxes just matters. And the output Region 82 and Region 94, and Region 106 is another confusion: what do they mean?
Region 82 Avg IOU: 0.790874, Class: 0.993619, Obj: 0.970194, No Obj: 0.002241, .5R: 1.000000, .75R: 0.666667, count: 3
Region 94 Avg IOU: 0.665403, Class: 0.775035, Obj: 0.567849, No Obj: 0.000524, .5R: 0.800000, .75R: 0.200000, count: 5
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000002, .5R: -nan, .75R: -nan, count: 0
How can it know how many objects the batch has at each layer? And if it decides each object has a layer
to belong to, then would the distribution of anchor boxes be a big problem?
Could anyone help with this? Thanks a lot.
I have successfully trained 1-object detection YOLO-2 model, but still doesn't understand- what role anchors in cfg file plays? I has changed them, but didn't see any effect. 1) What is the meaning of anchors? 2) @AlexeyAB: Alexey, what do you mean "real sizes of objects"? 3) If anybody has some program choosing one best weight from the trained set of weights, based on test annotated images? 4) What is the advantage of YOLO-3 over YOLO-2 ?
Hi@AlexeyAB , yolo v3-spp sounds good, is there any tutorial on how to train it? thank you
When I am trying to calculate the anchors k-means++ can't be used without OpenCV, because there is used cvKMeans2 implementation
, this error is coming. How to resolve this??
Hi everyone, Has anyone had success with training YOLOv3 for their own datasets? If so, could you help sort out some questions for me:
For me, I have a 5 class object detection problem. In the .cfg file, I have changed the number of classes and the number of filters to 3*(num_classes+5) = 30 in 3 different places. I can initiate the training but the loss blows up to start with and I am seeing a bunch of nans in the output massage (see snippet)
Here are my questions:
Thanks!