Open Samleo8 opened 5 years ago
More interesting observations
vs the one from the COCO website:
identify
command, and realised that both "formats" are different.
Could this be the source of the problem?
@Samleo8 Hi,
How many corrupted pictures have you found?
@AlexeyAB Hi,
Only one - the one i posted.
However,
I checked both using the identify command, and realised that both "formats" are different.
This applies for ALL the images
Can you help me with this problem, I've tried working on it for 3 days already, and am getting quite frustrated :(
@Samleo8
So what is the problem?
I can successfully train yolov3-spp.cfg
model with these 49 missed labels by using my repository: https://github.com/AlexeyAB/darknet
By using modified script that downloads images from MS COCO site instead of pjreddie.com: https://github.com/AlexeyAB/darknet/blob/master/scripts/get_coco_dataset.sh
By using command:
./darknet detector train cfg/coco.data yolov3-spp.cfg darknet53.conv.74 -map
or by training in terminal without Window X (like Amazon EC2, ...)
./darknet detector train cfg/coco.data yolov3-spp.cfg darknet53.conv.74 -map -dont_show
Content of cfg/coco.data
classes= 80
train = E:/MSCOCO/trainvalno5k.txt
#train = E:/MSCOCO/5k.txt
valid = E:/MSCOCO/5k.txt
names = data/coco.names
backup = backup
eval=coco
Also you can train COCO XNOR (1-bit) model instead of 32-bit by using
classes=80
and filters=255
for each [yolo]
layer) ./darknet detector train cfg/coco.data yolov3-spp_xnor_obj.cfg darknet53_448_xnor.conv.74 -map
And compile with GPU=1 CUDNN=1 CUDNN_HALF=1
in the Makefile
If you have Geforce RTX, then also un-comment this line: https://github.com/AlexeyAB/darknet/blob/d7a95aefb2209275af9145f8849962c46a00b39b/Makefile#L28
As described here: https://github.com/AlexeyAB/darknet/issues/2365#issuecomment-462923756
That's what's stumping me - I do not understand where the problem is..
By failing, I mean that the training will throw me lines like this (note the NaNs)
Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 63
1) It doesn't seem to be the missing labels as I was able to run and train it successfully on another server (but with only 1 Tesla GPU).
2) As this new server uses 4x Titan GPUs (i have to go check the exact models), I am going to try seeing if using the config for the Makefiles as you specified here will help
3) I'm using YOLOv2, following exactly the instructions by @pjreddie here.
4) I didn't use CUDNN_HALF=1
- what does that do?
What is most interesting, is that it only fails at certain iterations, and at certain images. Even more interesting, is that I try to run from the backup files (which are ok), it'll run fine for a while before faililng at random iterations (but more than maybe 2000 iterations after).
Thanks for your help!
Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 63
This line is normal
848: -nan, -nan avg, 0.002068 rate, 1.386888 seconds, 217088 images
But this line is bad: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
Note: If during training you see
nan
values for avg (loss) field - then training goes wrong, but if nan is in some other lines - then training goes well.
Try to train with this repository: https://github.com/AlexeyAB/darknet
with flag -check_mistakes
(and -dont_show
if you train on remote server)
It gives more debug messages.
Also check your bad.list
and bad_label.list
files.
Alright I'll give it a try!
same problem.... I just delete these 46 labels...
Problem persists after deleting labels. Perplexingly, the thing works under another user?!? Perhaps there are some issues with config or something
I just have missing labels problem, no NaN. I deleted missing labels, worked well.
I checked the image you found corrupt. It is not the one you posted..................
COCO_train2014_000000478643.jpg
do I get the wrong dataset?
Nope, thanks for pointing out! I think I might have posted the wrong image number... I'll update it soon
Labels not found
As issue #1003 has noted before, there are missing COCO labels in val2014 dataset.
I used a bash script to find out which one of the labels are the problematic ones:
That's a total of 46 labels not found.
I was wondering why this is the case, and if it can be fixed, or have the labels removed by default.
Training throws NaN halfway However, unlike the issue at #1003, when I try training the COCO dataset on my computer (4 GPUs, if that's a potential problem), I get this problem where it throws
NaN
halfway. Below is a snapshot of the iterations and iterations results of when it starts to throwNaN
:Interesting observations 1) This
NaN
is thrown even AFTER removing the fault labels from thetrainvalno5k.txt
file. 2)NaN
is thrown in some parts of the 834th to 848th iteration and then afterwards all other irerations gloriously fail with NaN for everything. 3) It fails on different iterations each time?!?Can anyone help me fix this problem, it's very annoying because I can kind of only check if any workarounds work after the 800th+ iteration...