microsoft / singleshotpose

This research project implements a real-time object detection and pose estimation method as described in the paper, Tekin et al. "Real-Time Seamless Single Shot 6D Object Pose Prediction", CVPR 2018. (https://arxiv.org/abs/1711.08848).
MIT License
727 stars 214 forks source link

Low 5px 2D accuracy, but high 3D Transformation accuracy #93

Open jgcbrouns opened 5 years ago

jgcbrouns commented 5 years ago

Hi everyone,

I created a synthetic dataset of a household product. This was trained for 1400 epochs (not 700) 7450 results Interesting to note is that the cases where predictions fail are often of specific angles where more points are not visible but are occluded by the object. Could this have to do something with the overall bad accuracy?

It consists of 7450 images where 80% is used for training. As you can see, the 5px 2D accuracy is low, which is in contrast with my earlier tests (with different models).

What I don't understand @btekin is how you managed to reach such high accuracies on Linemod with only 200 training images. The images in Linemod are of low quality and the objects are not exactly reach in features... When I train a model with only 200 trainingimages, I often do not reach higher accuracies than 15%!. You speak about pretraining for coordinate predictions by setting the confidence to 0, but then the model would never save its weights because accuracy does never go > 0%. Or should I just save the model every 10 epochs or so?

What could be the cause of this?

1 possible explanation could be that this object's corners are all outside of the object. But technically it should regresses these points estimates as well as coordinates that are on the object itself?

Some side notes:

@btekin as always I look forward to hearing from you!

jgcbrouns commented 5 years ago

Note that this 3D model has reflective properties: it's black surface is highly reflective. I use a picture-wall that is filled with random picture of the VOC dataset and a cubemap + shader to generate random reflections in my 3D model. This could be of influence to the model not being able to learn properly. However, when I turn reflections off and replace it for matt surface properties, the results are the same.

danieldimit commented 5 years ago

Hi @jgcbrouns you know more on this method than me so I can't help you. I'm interested what learning rate you used and what initial weights. It would also be very kind of you if you could provide a link to your dataset and ply object, so I could test with it too.

btekin commented 5 years ago

I wonder how you chose your training data. Is it by randomly sampling? I think it would be important to have a better sampling of the viewpoints so that you have a large coverage of different viewpoints and distances to your object. We used the same training split with the BB8 paper (Rad & Lepetit, ICCV'17) and this was, I think, sampled such that the objects are seen from various different angles (which allows to use a smaller number of training images, but still reasonably high accuracy). They explain their selection of training images in their paper as follows, it might be helpful in your case as well:

The training images are selected as in [2], such thatrelative orientation between them should be larger than athreshold.

Visualizations also show that the object does not contain much texture and is somewhat symmetrical. Given these, I would find the output qualitative results and the overall accuracy reasonable. Of course, it would be possible to get better convergence by tweaking the parameters of the network, however I would not expect batch size to have a big influence on the accuracy. There are a few relevant studies which predicts semantic keypoints on the object, which are then used for pose estimation. I think, predicting points on the object rather than bounding box corners would be worth a try.