core dumped while training yolov3 on open images

mosheliv commented 6 years ago

Hi,

I am trying to train tolov3 on a subset of the google open images. it has 601 classes. after a little while (sometimes two lines, sometimes l20 sometime 60) it core dumps.

attached please find the cfg and data. the annotations were naturally converted but it is a bit hard to know if any were wrong as i have no idea where it core dumped. random check seems that the conversion was good.

anyone has any idea what can cause it?

Regards, Moshe

config.zip

Loading weights from darknet53.conv.74...Done!
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing
480
Loaded: 0.000043 seconds
Region 82 Avg IOU: 0.318240, Class: 0.563201, Obj: 0.492985, No Obj: 0.521314, .5R: 0.000000, .75R: 0.000000,  count: 3
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.502343, .5R: -nan, .75R: -nan,  count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.514389, .5R: -nan, .75R: -nan,  count: 0
Region 82 Avg IOU: 0.317945, Class: 0.470210, Obj: 0.605590, No Obj: 0.518882, .5R: 0.000000, .75R: 0.000000,  count: 4
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.503683, .5R: -nan, .75R: -nan,  count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.515904, .5R: -nan, .75R: -nan,  count: 0
Region 82 Avg IOU: 0.054975, Class: 0.447558, Obj: 0.569529, No Obj: 0.520337, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.032020, Class: 0.183908, Obj: 0.205472, No Obj: 0.504832, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.514239, .5R: -nan, .75R: -nan,  count: 0
Region 82 Avg IOU: 0.324950, Class: 0.285847, Obj: 0.393897, No Obj: 0.520361, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.504448, .5R: -nan, .75R: -nan,  count: 0
Region 106 Avg IOU: 0.149500, Class: 0.609727, Obj: 0.358186, No Obj: 0.515050, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 82 Avg IOU: 0.257631, Class: 0.597304, Obj: 0.839517, No Obj: 0.520460, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.456243, Class: 0.295794, Obj: 0.321855, No Obj: 0.504092, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.512602, .5R: -nan, .75R: -nan,  count: 0
Region 82 Avg IOU: 0.366181, Class: 0.292589, Obj: 0.533888, No Obj: 0.519569, .5R: 0.000000, .75R: 0.000000,  count: 3
Region 94 Avg IOU: 0.042411, Class: 0.414390, Obj: 0.598276, No Obj: 0.503701, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.512066, .5R: -nan, .75R: -nan,  count: 0
Region 82 Avg IOU: 0.150158, Class: 0.845353, Obj: 0.687621, No Obj: 0.521024, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.472186, Class: 0.583142, Obj: 0.457525, No Obj: 0.504576, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.515444, .5R: -nan, .75R: -nan,  count: 0
Segmentation fault (core dumped)

mosheliv commented 6 years ago

Just adding some more information: compiling with debug and running with gdb, got the following:

Thread 1 "darknet" received signal SIGSEGV, Segmentation fault.
0x0000000000488312 in get_yolo_box (x=0x4691f490, biases=0xa13bb0, n=8, 
    index=-1376440676, i=12999987, j=12999987, lw=13, lh=13, w=416, h=416, 
    stride=169) at ./src/yolo_layer.c:86
86          b.x = (i + x[index + 0*stride]) / lw;
(gdb) where
#0  0x0000000000488312 in get_yolo_box (x=0x4691f490, biases=0xa13bb0, n=8, 
    index=-1376440676, i=12999987, j=12999987, lw=13, lh=13, w=416, h=416, 
    stride=169) at ./src/yolo_layer.c:86
#1  0x00000000004884d0 in delta_yolo_box (truth=..., x=0x4691f490, 
    biases=0xa13bb0, n=8, index=-1376440676, i=12999987, j=12999987, lw=13, 
    lh=13, w=416, h=416, delta=0x4646f1e0, scale=-9.9999803e+11, stride=169)
    at ./src/yolo_layer.c:95
#2  0x00000000004893ff in forward_yolo_layer (l=..., net=...)
    at ./src/yolo_layer.c:219
#3  0x000000000048a3c0 in forward_yolo_layer_gpu (l=..., net=...)
    at ./src/yolo_layer.c:365
#4  0x0000000000463479 in forward_network_gpu (netp=0x9ca250)
    at ./src/network.c:778
#5  0x00000000004607fd in forward_network (netp=0x9ca250)
    at ./src/network.c:192
#6  0x0000000000460ee8 in train_network_datum (net=0x9ca250)
    at ./src/network.c:293
#7  0x00000000004610c2 in train_network (net=0x9ca250, d=...)
    at ./src/network.c:324
#8  0x000000000041eef2 in train_detector (datacfg=0x7fffffffe797 "oid.data", 
    cfgfile=0x7fffffffe7a0 "cfg/yolov3-oid.cfg", 
    weightfile=0x7fffffffe7b3 "darknet53.conv.74", gpus=0x7fffffffe324, 
    ngpus=1, clear=0) at ./examples/detector.c:118
#9  0x0000000000422a5d in run_detector (argc=6, argv=0x7fffffffe518)
    at ./examples/detector.c:842
#10 0x0000000000426e66 in main (argc=6, argv=0x7fffffffe518)
    at ./examples/darknet.c:434

liben2018 commented 6 years ago

Just check your label files, maybe some line with 0.

mosheliv commented 6 years ago

Can you elaborate? The label files have ids by the position of the label in the file, first one is 0 if i am not mistaken. So do you mean empty file in the ground truth? I have made sure this won't happen in the generation process. From casual look at the code it seems that because of the large amount of classes i have gone over the maxint somewhere... However, this is not easy to read or follow code so i might be wrong

liben2018 commented 6 years ago

I guess your some labels/xxx.txt files have x=0 or y=0, like, 0 0 0 0.059 0.008 the expect one should be 0 0.136 0.043 0.059 0.008 so, maybe you can modify the python file, darknet/scripts/voc_label.py def convert(size, box): x = (box[0] + box[1])/2.0 - 1 y = (box[2] + box[3])/2.0 - 1 to generate your labels file without 0 for x and y.

mosheliv commented 6 years ago

Oh i see! It is using middle x, y and w and h. As far as I remember i did convert everything but I'll recheck. Thank you!

pjreddie / darknet

core dumped while training yolov3 on open images #970