j96w / DenseFusion

"DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion" code repository
https://sites.google.com/view/densefusion
MIT License
1.09k stars 300 forks source link

pred = torch.add(torch.bmm(model_points, base), points + pred_t), why? #119

Open fabiopoiesi opened 4 years ago

fabiopoiesi commented 4 years ago

Hi,

I am trying to figure out the why of this prediction. pred = torch.add(torch.bmm(model_points, base), points + pred_t)

model_points (1x500x3) is the sub-sampled ground-truth point cloud of the object (e.g. 500 points) in its original coordinate system. base (500x3x3) is the pixel-wise rotation prediction. points (500x1x3) is the point cloud obtained from the depth image through the camera intrinsics. points is a partial view of the object. points is 500 "transformed depth values". pred_t (500x1x3) is the pixel-wise predicted translation, i.e. 500 predictions torch.bmm(model_points, base) (500x500x3) produces 500 rotated versions of model_points based on the pixel-wise predicted rotations. pred (500x500x3) is the pixel-wise rotated point cloud prediction

Why is each element of points summed to a predicted translation? On the paper it is written: "we will train this network to predict one pose from each densely-fused feature". Here a predicted translation is summed to a single transformed depth value, that is a point of the point cloud inferred from the depth image.

Why is 'points + pred_t' summed to the whole point cloud of the object? Here a single transformed depth value is used as translation for the rotated ground-truth point cloud.

roywithfiringblade commented 4 years ago

check #113 and #7

fabiopoiesi commented 4 years ago

Sorry but the comments you pointed at do not answer my doubts. I already guessed that it is more robust to predict the offset with respect to the depth measurements rather than estimating the absolute translation.

EDIT: I am not sure if I expressed my self properly: points are the 500 points of the point cloud extracted/chosen from the depth pred_t are 500 predicted translations, but as as far as I understood from the paper they should represent 500 object pose candidates with the equation: points + pred_t -> why each single predicted translation in pred_t is summed to each single point of the point cloud in points? This means that each point of the point cloud is translated by a different value. I instead understood that pred_t should contain 500 different translations of the (whole) object pose. I was expecting that the points in points were translated for each single predicted translation in pred_t.

Is it just a work around to have some depth values to predict an extra shift? Like an anchor?

greatwallet commented 4 years ago

I think that's 'cause points itself could viewed as not only a form of model's point cloud in camera's coordination, but also a sort of absolute translation.

zuoligang1997 commented 4 years ago

Do you understand the problem now? Can you explain it if you get it?

JiChun-Wang commented 3 years ago

The following ideas are what I think:

First, the true translation is a vector from the origin of the camera candidate to that of the object candidate;

Second, each element of points is actually a vector pointing to a point in the frame of camera, each element of pred_t is a vector pointing to the same point in the frame of object;

So, (points + pred_t) indicates the summation of the above two vector and the result is just the vector of the real translation.