Training on custom object fails

danieldimit commented 5 years ago

Hi there I've been trying to train my model on a custom object for a while now. After failing many times I decided to do it incrementally.

I trained a model on the ape from the linemod dataset and it worked well
I generated a dataset with the ape object with a picture from the linemod as background. It worked well (link to dataset)
Now I took the same setup from step 2. and only change the 3d model used with a 3d model of PSP in the dataset generation tool (and the diameter in the new psp.data). (link to dataset) This failed and I have no idea why. I guessed it could be because the psp is a lot thinner and its bounding box is not a cube, so I stretched the 3d model to be more like a cube, but it didn't work. I tried different initial weights and learning rates, but that also didn't help.

Here is a comparison between the 2 datasets that I've created: github_info

And the worst part is that the PSP training stops because of this exception:

proposals goes to 0, most other stuff goes to nan and when it reaches a test

13090: nGT 8, recall 0, proposals 0, loss: x 150.228271, y 342.138519, conf 0.150508, total 492.517303
13098: nGT 8, recall 0, proposals 0, loss: x 243.437836, y 445.992981, conf 0.248933, total 689.679749
13106: nGT 8, recall 0, proposals 0, loss: x nan, y nan, conf 0.065831, total nan
13114: nGT 8, recall 0, proposals 0, loss: x nan, y nan, conf nan, total nan
13122: nGT 8, recall 0, proposals 0, loss: x nan, y nan, conf nan, total nan
13130: nGT 8, recall 0, proposals 0, loss: x nan, y nan, conf nan, total nan
13138: nGT 8, recall 0, proposals 0, loss: x nan, y nan, conf nan, total nan 
....
xxxxx nGT 8, recall 0, proposals 0, loss: x nan, y nan, conf nan, total nan
xxxxx: nGT 6, recall 0, proposals 0, loss: x nan, y nan, conf nan, total nan
2019-07-03 00:22:07    Testing...
2019-07-03 00:22:07    Number of test samples: 360
Traceback (most recent call last):
File "train.py", line 399, in <module>
test(epoch, niter)
File "train.py", line 168, in test
all_boxes = get_region_boxes(output, conf_thresh, num_classes, anchors, num_anchors)        
File "/home/daniel/singleshotpose/utils.py", line 446, in get_region_boxes
bcx0 = xs0[max_ind]
UnboundLocalError: local variable 'max_ind' referenced before assignment

Has anyone else had this problem? What could be any potential causes in your opinion? I would be glad to hear from anyone. Thanks in advance!

P.S. I've used the correct diameter in the *.data file, so that shouldn't be the problem. My label files should also be correct otherwise dataset from step 2 wouldn't have worked. You can see the numbering in the the picture.

danieldimit commented 5 years ago

Update: I figured out when the nan problem appears. It happens when the object in the dataset is too near the camera. Notice on the right picture the object is much closer. I am not sure why this is happening, but I guess it has something to do with the data augmentation. I guess that the method scales the image and when the object takes most of the image when the images gets zoomed in the object gets out of bounds which leads to nan.

I've now created datasets where the object is further away - one with PSP(1200 images) and one with a GUITAR(1200 images). Both of them learn but not nearly as fast as the synthetic APE dataset. The APE object had 5px acc 86% on epoch 100. The PSP dataset has 5px acc 13% epoch 590. The GUITAR dataset has 5px acc 8% on epoch 350. I will leave the PSP and GUITAR datasets to be learning during the weekend to see how much the accuracy would improve.

btekin commented 5 years ago

Hi @danieldimit, thanks for your question and finding about the issue with convergence when the object is too large that it takes most of the image.

About the accuracy numbers:

Are you pretraining the network where the parameter for the confidence regularization parameter is set to 0?
You might want to also consider changing the learning schedule. Learning schedule, in its current form of the code, is changed according to the current step during iteration (see here). When you have different number of training examples you might consider adjusting the number of steps in the above line, to change the learning rate schedules for the particular dataset you are working with.

danieldimit commented 5 years ago

Hi @btekin, I've solved the problem. It wasn't connected to your method. It turned out that something was wrong with the GPU CUDA setup on the servers at work (probably misalignment between GPU, CUDA version and driver version). After trying for a month and getting inconsistent results (the problem could occur after 10 or after 10000 iterations), I decided to try it on a Google Cloud GPU and bingo - with the same code, same dataset and the same training configuration it worked like a charm from the first time.

I tried setting the confidence regularization parameter to 0, but it didn't work. So for my experiments I trained it without pretraining. This is the confidence regularization parameter, right?

btekin commented 5 years ago

I tried setting the confidence regularization parameter to 0, but it didn't work. So for my experiments I trained it without pretraining. This is the confidence regularization parameter, right?

Yes, that is the confidence regularization parameter. With pre-training, the convergence might be a bit faster, but without it, it should also be fine.

Great to hear that you resolved the initial problem by switching to another compute platform!

microsoft / singleshotpose

Training on custom object fails #103