microsoft / singleshotpose

This research project implements a real-time object detection and pose estimation method as described in the paper, Tekin et al. "Real-Time Seamless Single Shot 6D Object Pose Prediction", CVPR 2018. (https://arxiv.org/abs/1711.08848).
MIT License
725 stars 214 forks source link

Training on custom object fails #103

Closed danieldimit closed 5 years ago

danieldimit commented 5 years ago

Hi there I've been trying to train my model on a custom object for a while now. After failing many times I decided to do it incrementally.

  1. I trained a model on the ape from the linemod dataset and it worked well
  2. I generated a dataset with the ape object with a picture from the linemod as background. It worked well (link to dataset)
  3. Now I took the same setup from step 2. and only change the 3d model used with a 3d model of PSP in the dataset generation tool (and the diameter in the new psp.data). (link to dataset) This failed and I have no idea why. I guessed it could be because the psp is a lot thinner and its bounding box is not a cube, so I stretched the 3d model to be more like a cube, but it didn't work. I tried different initial weights and learning rates, but that also didn't help.

Here is a comparison between the 2 datasets that I've created: github_info

And the worst part is that the PSP training stops because of this exception:

Has anyone else had this problem? What could be any potential causes in your opinion? I would be glad to hear from anyone. Thanks in advance!

P.S. I've used the correct diameter in the *.data file, so that shouldn't be the problem. My label files should also be correct otherwise dataset from step 2 wouldn't have worked. You can see the numbering in the the picture.

danieldimit commented 5 years ago

Update: I figured out when the nan problem appears. It happens when the object in the dataset is too near the camera. Notice on the right picture the object is much closer. I am not sure why this is happening, but I guess it has something to do with the data augmentation. I guess that the method scales the image and when the object takes most of the image when the images gets zoomed in the object gets out of bounds which leads to nan.

I've now created datasets where the object is further away - one with PSP(1200 images) and one with a GUITAR(1200 images). Both of them learn but not nearly as fast as the synthetic APE dataset. The APE object had 5px acc 86% on epoch 100. The PSP dataset has 5px acc 13% epoch 590. The GUITAR dataset has 5px acc 8% on epoch 350. I will leave the PSP and GUITAR datasets to be learning during the weekend to see how much the accuracy would improve.

btekin commented 5 years ago

Hi @danieldimit, thanks for your question and finding about the issue with convergence when the object is too large that it takes most of the image.

About the accuracy numbers:

danieldimit commented 5 years ago

Hi @btekin, I've solved the problem. It wasn't connected to your method. It turned out that something was wrong with the GPU CUDA setup on the servers at work (probably misalignment between GPU, CUDA version and driver version). After trying for a month and getting inconsistent results (the problem could occur after 10 or after 10000 iterations), I decided to try it on a Google Cloud GPU and bingo - with the same code, same dataset and the same training configuration it worked like a charm from the first time.

I tried setting the confidence regularization parameter to 0, but it didn't work. So for my experiments I trained it without pretraining. This is the confidence regularization parameter, right?

btekin commented 5 years ago

I tried setting the confidence regularization parameter to 0, but it didn't work. So for my experiments I trained it without pretraining. This is the confidence regularization parameter, right?

Yes, that is the confidence regularization parameter. With pre-training, the convergence might be a bit faster, but without it, it should also be fine.

Great to hear that you resolved the initial problem by switching to another compute platform!