Missing labels for COCO Dataset; Training throws NaN halfway

Samleo8 commented 5 years ago

Labels not found

As issue #1003 has noted before, there are missing COCO labels in val2014 dataset.

I used a bash script to find out which one of the labels are the problematic ones:

[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000023017.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000058636.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000101623.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000127135.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000130712.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000176701.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000179430.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000192817.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000224742.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000228771.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000252101.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000253688.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000274957.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000297736.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000300090.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000301765.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000310622.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000328098.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000337653.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000339740.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000340781.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000344730.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000345711.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000359276.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000361831.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000382333.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000404601.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000405104.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000406217.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000421673.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000441788.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000441863.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000450098.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000459590.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000461275.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000476491.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000481404.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000486632.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000487702.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000502910.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000509423.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000514540.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000544597.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000562582.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000564317.txt Not found!
[ERROR] Label /hdd/sam/YOLO/yolo2/data/coco/labels/val2014/COCO_val2014_000000566103.txt Not found!

That's a total of 46 labels not found.

I was wondering why this is the case, and if it can be fixed, or have the labels removed by default.

Training throws NaN halfway However, unlike the issue at #1003, when I try training the COCO dataset on my computer (4 GPUs, if that's a potential problem), I get this problem where it throws NaN halfway. Below is a snapshot of the iterations and iterations results of when it starts to throw NaN:

Region Avg IOU: 0.194388, Class: 0.016090, Obj: 0.050314, No Obj: 0.045159, Avg Recall: 0.100000, count: 50
Region Avg IOU: 0.253454, Class: 0.030912, Obj: 0.026001, No Obj: 0.023917, Avg Recall: 0.156250, count: 64
Region Avg IOU: 0.230630, Class: 0.019580, Obj: 0.052059, No Obj: 0.041340, Avg Recall: 0.093750, count: 64
Region Avg IOU: 0.238394, Class: 0.029597, Obj: 0.038467, No Obj: 0.033275, Avg Recall: 0.211538, count: 52
Region Avg IOU: 0.289603, Class: 0.020954, Obj: 0.051858, No Obj: 0.046541, Avg Recall: 0.277778, count: 54
Region Avg IOU: 0.236887, Class: 0.017968, Obj: 0.028113, No Obj: 0.022178, Avg Recall: 0.104167, count: 48
Region Avg IOU: 0.217502, Class: 0.031087, Obj: 0.042109, No Obj: 0.038441, Avg Recall: 0.136986, count: 73
Region Avg IOU: 0.258391, Class: 0.019244, Obj: 0.075716, No Obj: 0.067922, Avg Recall: 0.216216, count: 74
Region Avg IOU: 0.289033, Class: 0.015091, Obj: 0.054796, No Obj: 0.048377, Avg Recall: 0.210526, count: 38
Region Avg IOU: 0.259007, Class: 0.022965, Obj: 0.053606, No Obj: 0.049584, Avg Recall: 0.202703, count: 74
Region Avg IOU: 0.250242, Class: 0.031246, Obj: 0.076865, No Obj: 0.074771, Avg Recall: 0.224490, count: 49
Region Avg IOU: 0.159392, Class: 0.035627, Obj: 0.048476, No Obj: 0.039545, Avg Recall: 0.060976, count: 82
Region Avg IOU: 0.279423, Class: 0.016240, Obj: 0.040270, No Obj: 0.027036, Avg Recall: 0.272727, count: 33
Region Avg IOU: 0.263860, Class: 0.014979, Obj: 0.060803, No Obj: 0.060995, Avg Recall: 0.211538, count: 52
Region Avg IOU: 0.360564, Class: 0.017551, Obj: 0.047842, No Obj: 0.041459, Avg Recall: 0.343750, count: 32
Region Avg IOU: 0.212877, Class: 0.020789, Obj: 0.088854, No Obj: 0.076753, Avg Recall: 0.179487, count: 39
Region Avg IOU: 0.221978, Class: 0.026050, Obj: 0.036272, No Obj: 0.033913, Avg Recall: 0.087719, count: 57
Region Avg IOU: 0.325184, Class: 0.016296, Obj: 0.057357, No Obj: 0.046819, Avg Recall: 0.303030, count: 33
Region Avg IOU: 0.301548, Class: 0.024889, Obj: 0.043422, No Obj: 0.038330, Avg Recall: 0.117647, count: 51
**Region Avg IOU: -nan, Class: 0.022121, Obj: 0.059475, No Obj: 0.055912, Avg Recall: 0.224138, count: 58**
Region Avg IOU: 0.242463, Class: 0.019389, Obj: 0.088536, No Obj: 0.075967, Avg Recall: 0.169811, count: 53
Region Avg IOU: 0.207398, Class: 0.024001, Obj: 0.053449, No Obj: 0.048903, Avg Recall: 0.122642, count: 106
Region Avg IOU: 0.336272, Class: 0.021601, Obj: 0.062953, No Obj: 0.051201, Avg Recall: 0.281250, count: 32
Region Avg IOU: 0.271610, Class: 0.022751, Obj: 0.085402, No Obj: 0.075970, Avg Recall: 0.305556, count: 36
Region Avg IOU: 0.258463, Class: 0.024350, Obj: 0.050088, No Obj: 0.041069, Avg Recall: 0.197368, count: 76
Region Avg IOU: 0.167621, Class: 0.025985, Obj: 0.043610, No Obj: 0.038235, Avg Recall: 0.119403, count: 67
Region Avg IOU: 0.270709, Class: 0.027101, Obj: 0.033603, No Obj: 0.029871, Avg Recall: 0.196721, count: 61
Region Avg IOU: 0.247298, Class: 0.018077, Obj: 0.065581, No Obj: 0.052222, Avg Recall: 0.098039, count: 51
Region Avg IOU: 0.121556, Class: 0.034051, Obj: 0.035138, No Obj: 0.031790, Avg Recall: 0.031579, count: 95
Region Avg IOU: 0.278392, Class: 0.019095, Obj: 0.045178, No Obj: 0.039717, Avg Recall: 0.195122, count: 82
Region Avg IOU: 0.199256, Class: 0.020275, Obj: 0.063087, No Obj: 0.059566, Avg Recall: 0.149254, count: 67
Region Avg IOU: 0.186371, Class: 0.019705, Obj: 0.041615, No Obj: 0.030014, Avg Recall: 0.114286, count: 35
**834: -nan, -nan avg, 0.001935 rate, 1.487227 seconds, 213504 images**
Resizing
320
Loaded: 0.000034 seconds
Region Avg IOU: 0.284508, Class: 0.035984, Obj: 0.028136, No Obj: 0.028884, Avg Recall: 0.227273, count: 44
Region Avg IOU: 0.220419, Class: 0.031237, Obj: 0.043251, No Obj: 0.037743, Avg Recall: 0.114286, count: 35
Region Avg IOU: 0.199512, Class: 0.020882, Obj: 0.019264, No Obj: 0.018819, Avg Recall: 0.166667, count: 48
**Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 63**
Region Avg IOU: 0.171892, Class: 0.018784, Obj: 0.045324, No Obj: 0.033691, Avg Recall: 0.179487, count: 39
Region Avg IOU: 0.227782, Class: 0.017111, Obj: 0.054097, No Obj: 0.057206, Avg Recall: 0.153846, count: 39
**Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 31**
Region Avg IOU: 0.153943, Class: 0.031992, Obj: 0.028538, No Obj: 0.018924, Avg Recall: 0.092593, count: 54
**Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 51**
Region Avg IOU: 0.222572, Class: 0.025267, Obj: 0.081511, No Obj: 0.068913, Avg Recall: 0.154930, count: 71
Region Avg IOU: 0.382120, Class: 0.028386, Obj: 0.063374, No Obj: 0.053010, Avg Recall: 0.391304, count: 23
Region Avg IOU: 0.264036, Class: 0.034740, Obj: 0.031429, No Obj: 0.030045, Avg Recall: 0.222222, count: 45
Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 81
Region Avg IOU: 0.247504, Class: 0.019069, Obj: 0.035158, No Obj: 0.031233, Avg Recall: 0.236842, count: 38
Region Avg IOU: 0.197272, Class: 0.018579, Obj: 0.041498, No Obj: 0.040630, Avg Recall: 0.075000, count: 40
Region Avg IOU: 0.169738, Class: 0.022768, Obj: 0.055933, No Obj: 0.053821, Avg Recall: 0.127907, count: 86
Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 65
Region Avg IOU: 0.134268, Class: 0.062033, Obj: 0.050729, No Obj: 0.027623, Avg Recall: 0.142857, count: 28
Region Avg IOU: 0.365230, Class: 0.016531, Obj: 0.043904, No Obj: 0.036799, Avg Recall: 0.347826, count: 23
Region Avg IOU: 0.260525, Class: 0.035850, Obj: 0.026104, No Obj: 0.022106, Avg Recall: 0.215385, count: 65
Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 42
Region Avg IOU: 0.208824, Class: 0.015850, Obj: 0.040567, No Obj: 0.039637, Avg Recall: 0.254545, count: 55
Region Avg IOU: 0.321671, Class: 0.020957, Obj: 0.047594, No Obj: 0.034239, Avg Recall: 0.250000, count: 52
Region Avg IOU: 0.203240, Class: 0.039144, Obj: 0.034220, No Obj: 0.029897, Avg Recall: 0.108434, count: 83
**Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 45**
Region Avg IOU: 0.216693, Class: 0.025137, Obj: 0.035198, No Obj: 0.033880, Avg Recall: 0.200000, count: 35
Region Avg IOU: 0.199807, Class: 0.023664, Obj: 0.044049, No Obj: 0.040821, Avg Recall: 0.135593, count: 59
Region Avg IOU: 0.297963, Class: 0.026240, Obj: 0.035659, No Obj: 0.036741, Avg Recall: 0.170213, count: 47
**Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 107**
Region Avg IOU: 0.314458, Class: 0.019851, Obj: 0.050670, No Obj: 0.035847, Avg Recall: 0.333333, count: 39
Region Avg IOU: 0.147929, Class: 0.040922, Obj: 0.041641, No Obj: 0.014190, Avg Recall: 0.098039, count: 51
Region Avg IOU: 0.310373, Class: 0.025539, Obj: 0.059014, No Obj: 0.053865, Avg Recall: 0.257143, count: 35
835: nan, -nan avg, 0.001944 rate, 1.037354 seconds, 213760 images
Loaded: 0.010008 seconds
**Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 45**
Region Avg IOU: 0.275835, Class: 0.017691, Obj: 0.161072, No Obj: 0.027721, Avg Recall: 0.222222, count: 63
Region Avg IOU: 0.195436, Class: 0.030653, Obj: 0.052247, No Obj: 0.057506, Avg Recall: 0.126761, count: 71
Region Avg IOU: 0.156924, Class: 0.036716, Obj: 0.052524, No Obj: 0.048090, Avg Recall: 0.156863, count: 51
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 26
Region Avg IOU: 0.248738, Class: 0.029181, Obj: 0.074370, No Obj: 0.032307, Avg Recall: 0.214286, count: 42
Region Avg IOU: 0.267293, Class: 0.019323, Obj: 0.057249, No Obj: 0.060754, Avg Recall: 0.188406, count: 69
Region Avg IOU: 0.186188, Class: 0.018508, Obj: 0.027070, No Obj: 0.022438, Avg Recall: 0.193548, count: 31
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 67
Region Avg IOU: 0.314581, Class: 0.026012, Obj: 0.039959, No Obj: 0.029860, Avg Recall: 0.225000, count: 40
Region Avg IOU: 0.265572, Class: 0.013070, Obj: 0.040644, No Obj: 0.035243, Avg Recall: 0.192308, count: 26
Region Avg IOU: 0.220915, Class: 0.020241, Obj: 0.044857, No Obj: 0.053678, Avg Recall: 0.155556, count: 45
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 42
Region Avg IOU: 0.171746, Class: 0.037250, Obj: 0.032539, No Obj: 0.037575, Avg Recall: 0.148148, count: 81
Region Avg IOU: 0.229439, Class: 0.024297, Obj: 0.034194, No Obj: 0.028874, Avg Recall: 0.161290, count: 62
Region Avg IOU: 0.281748, Class: 0.039486, Obj: 0.037240, No Obj: 0.038551, Avg Recall: 0.170213, count: 47
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 24
Region Avg IOU: 0.201438, Class: 0.027169, Obj: 0.045803, No Obj: 0.036801, Avg Recall: 0.149254, count: 67
Region Avg IOU: 0.352646, Class: 0.020675, Obj: 0.065616, No Obj: 0.046739, Avg Recall: 0.242424, count: 33
Region Avg IOU: 0.154915, Class: 0.034691, Obj: 0.060318, No Obj: 0.046694, Avg Recall: 0.052632, count: 38
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 45
Region Avg IOU: 0.287269, Class: 0.013508, Obj: 0.195551, No Obj: 0.034653, Avg Recall: 0.303030, count: 33
Region Avg IOU: 0.106120, Class: 0.018362, Obj: 0.021664, No Obj: 0.019428, Avg Recall: 0.023256, count: 43
Region Avg IOU: 0.211341, Class: 0.020213, Obj: 0.137786, No Obj: 0.044453, Avg Recall: 0.135802, count: 81
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 42
Region Avg IOU: 0.191334, Class: 0.042877, Obj: 0.079005, No Obj: 0.028556, Avg Recall: 0.166667, count: 60
Region Avg IOU: 0.200572, Class: 0.027321, Obj: 0.030312, No Obj: 0.028376, Avg Recall: 0.192308, count: 52
Region Avg IOU: 0.348641, Class: 0.030788, Obj: 0.047328, No Obj: 0.045236, Avg Recall: 0.298246, count: 57
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 53
Region Avg IOU: 0.311509, Class: 0.018948, Obj: 0.043289, No Obj: 0.029587, Avg Recall: 0.250000, count: 72
Region Avg IOU: 0.172666, Class: 0.039822, Obj: 0.032645, No Obj: 0.027804, Avg Recall: 0.093333, count: 75
Region Avg IOU: 0.284492, Class: 0.030818, Obj: 0.034242, No Obj: 0.045734, Avg Recall: 0.240000, count: 50
Syncing... Done!
848: -nan, -nan avg, 0.002068 rate, 1.386888 seconds, 217088 images
Loaded: 0.000022 seconds
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 20
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 50
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 41
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 49
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 56
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 36
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 57
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 46
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 32
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 43
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 69
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 34
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 66
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 33
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 50
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 48
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 47
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 44
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 51
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 52
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 39
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 61
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 119
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 44
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 31
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 76
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 122
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 75
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 73
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 47
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 25
Region Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 47
849: nan, -nan avg, 0.002078 rate, 1.032942 seconds, 217344 images
Loaded: 0.000029 seconds

Interesting observations 1) This NaN is thrown even AFTER removing the fault labels from the trainvalno5k.txt file. 2) NaN is thrown in some parts of the 834th to 848th iteration and then afterwards all other irerations gloriously fail with NaN for everything. 3) It fails on different iterations each time?!?

Can anyone help me fix this problem, it's very annoying because I can kind of only check if any workarounds work after the 800th+ iteration...

Samleo8 commented 5 years ago

More interesting observations

I compared the sha1sums (hash) of the images from a computer that ran the trainer fine, and the one that throws this NaN error. I then realised that both COCO datasets are from different download sources (one from the @pjreddie source and one from the official COCO dataset)
- I realised that just one of them _(COCO_train2014_000000478643.jpg)_ hosted on the @pjreddie website is corrupt:

vs the one from the COCO website: potential_corrupt_000000167126

I checked both using the identify command, and realised that both "formats" are different.
- pjreddie source: sRGB (8bit)
- COCO website source: Direct Class (8bit)

Could this be the source of the problem?

AlexeyAB commented 5 years ago

@Samleo8 Hi,

How many corrupted pictures have you found?

Samleo8 commented 5 years ago

@AlexeyAB Hi,

Only one - the one i posted.

However,

I checked both using the identify command, and realised that both "formats" are different.

This applies for ALL the images

Can you help me with this problem, I've tried working on it for 3 days already, and am getting quite frustrated :(

AlexeyAB commented 5 years ago

@Samleo8

So what is the problem?

I can successfully train yolov3-spp.cfg model with these 49 missed labels by using my repository: https://github.com/AlexeyAB/darknet
By using modified script that downloads images from MS COCO site instead of pjreddie.com: https://github.com/AlexeyAB/darknet/blob/master/scripts/get_coco_dataset.sh
By using command: ./darknet detector train cfg/coco.data yolov3-spp.cfg darknet53.conv.74 -map

or by training in terminal without Window X (like Amazon EC2, ...) ./darknet detector train cfg/coco.data yolov3-spp.cfg darknet53.conv.74 -map -dont_show

Content of cfg/coco.data

classes= 80
train  = E:/MSCOCO/trainvalno5k.txt
#train = E:/MSCOCO/5k.txt
valid  = E:/MSCOCO/5k.txt
names = data/coco.names
backup = backup
eval=coco

Also you can train COCO XNOR (1-bit) model instead of 32-bit by using

pre-trained weights file: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8
cfg-file: https://github.com/AlexeyAB/darknet/files/2853459/yolov3-spp_xnor_obj.cfg.txt (just set classes=80 and filters=255 for each [yolo] layer)

./darknet detector train cfg/coco.data yolov3-spp_xnor_obj.cfg darknet53_448_xnor.conv.74 -map

And compile with GPU=1 CUDNN=1 CUDNN_HALF=1 in the Makefile If you have Geforce RTX, then also un-comment this line: https://github.com/AlexeyAB/darknet/blob/d7a95aefb2209275af9145f8849962c46a00b39b/Makefile#L28

As described here: https://github.com/AlexeyAB/darknet/issues/2365#issuecomment-462923756

Samleo8 commented 5 years ago

That's what's stumping me - I do not understand where the problem is.. By failing, I mean that the training will throw me lines like this (note the NaNs) Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 63

1) It doesn't seem to be the missing labels as I was able to run and train it successfully on another server (but with only 1 Tesla GPU). 2) As this new server uses 4x Titan GPUs (i have to go check the exact models), I am going to try seeing if using the config for the Makefiles as you specified here will help 3) I'm using YOLOv2, following exactly the instructions by @pjreddie here. 4) I didn't use CUDNN_HALF=1 - what does that do?

What is most interesting, is that it only fails at certain iterations, and at certain images. Even more interesting, is that I try to run from the backup files (which are ok), it'll run fine for a while before faililng at random iterations (but more than maybe 2000 iterations after).

Thanks for your help!

AlexeyAB commented 5 years ago

Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 63

This line is normal

848: -nan, -nan avg, 0.002068 rate, 1.386888 seconds, 217088 images

But this line is bad: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

Note: If during training you see nan values for avg (loss) field - then training goes wrong, but if nan is in some other lines - then training goes well.

Try to train with this repository: https://github.com/AlexeyAB/darknet with flag -check_mistakes (and -dont_show if you train on remote server) It gives more debug messages.

Also check your bad.list and bad_label.list files.

Samleo8 commented 5 years ago

Alright I'll give it a try!

hktxt commented 5 years ago

same problem.... I just delete these 46 labels...

Samleo8 commented 5 years ago

Problem persists after deleting labels. Perplexingly, the thing works under another user?!? Perhaps there are some issues with config or something

hktxt commented 5 years ago

I just have missing labels problem, no NaN. I deleted missing labels, worked well.

I checked the image you found corrupt. It is not the one you posted..................

COCO_train2014_000000478643.jpg

do I get the wrong dataset?

Samleo8 commented 5 years ago

Nope, thanks for pointing out! I think I might have posted the wrong image number... I'll update it soon

pjreddie / darknet

Missing labels for COCO Dataset; Training throws NaN halfway #1533