microsoft / singleshotpose

This research project implements a real-time object detection and pose estimation method as described in the paper, Tekin et al. "Real-Time Seamless Single Shot 6D Object Pose Prediction", CVPR 2018. (https://arxiv.org/abs/1711.08848).
MIT License
720 stars 215 forks source link

final tensor for multiple objects in one cell #129

Closed belorenz closed 4 years ago

belorenz commented 4 years ago

First of all, thanks @btekin for your work and for publishing it here. My question is about the relation between a cell and a box. In the paper it reads:

Figure 1. ... The 3D output tensor from our network, which represents for each cell a vector consisting of the 2D corner locations, the class probabilities and a confidence value associated with the prediction.

Overall, our output 3D tensor depicted in Figure 1(e) has dimension S × S × D, where the 2D spatial grid corresponding to the image dimensions has S × S cells and each such cell has a D dimensional vector. Here, D = 9×2+C +1, because we have 9 (x i , y i ) control points, C class probabilities and one confidence value.

When multiple objects are located close to each other in the 3D scene, they are more likely to appear close together in the images or be occluded by each other. In these cases, certain cells might contain multiple objects. To be able to predict the pose of such multiple objects that lie in the same cell, we allow up to 5 candidates per cell and therefore predict five sets of control points per cell.

Maybe my question is easier to understand when I use an example.

Lets say we have M very close objects all lie in one cell. How does the final vector for this cell look like?

  1. (9×2+C +1)×M = M full vectors, each having a box, confidence value and a set of class probabilities (YOLO-v2-like).
  2. (9×2+1)×M + C) = M boxes with confidence scores but only one set of class probabilities per cell (YOLO-v1-like).
belorenz commented 4 years ago

I just saw that the final tensor for OCCLUSION dataset is

30 conv 160 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 160

Since there are 13 classes in OCCLUSION, each cell in the final layer has 5 boxes : (9 x 2 + 1 + 13) x 5 = 160 So I guess the first assumption is correct.