final tensor for multiple objects in one cell

First of all, thanks @btekin for your work and for publishing it here. My question is about the relation between a cell and a box. In the paper it reads:

Figure 1. ... The 3D output tensor from our network, which represents for each cell a vector consisting of the 2D corner locations, the class probabilities and a confidence value associated with the prediction.

Overall, our output 3D tensor depicted in Figure 1(e) has dimension S × S × D, where the 2D spatial grid corresponding to the image dimensions has S × S cells and each such cell has a D dimensional vector. Here, D = 9×2+C +1, because we have 9 (x i , y i ) control points, C class probabilities and one confidence value.

When multiple objects are located close to each other in the 3D scene, they are more likely to appear close together in the images or be occluded by each other. In these cases, certain cells might contain multiple objects. To be able to predict the pose of such multiple objects that lie in the same cell, we allow up to 5 candidates per cell and therefore predict five sets of control points per cell.

Maybe my question is easier to understand when I use an example.

Lets say we have M very close objects all lie in one cell. How does the final vector for this cell look like?

(9×2+C +1)×M = M full vectors, each having a box, confidence value and a set of class probabilities (YOLO-v2-like).
(9×2+1)×M + C) = M boxes with confidence scores but only one set of class probabilities per cell (YOLO-v1-like).

microsoft / singleshotpose

final tensor for multiple objects in one cell #129