Closed EckoTan0804 closed 3 years ago
Thanks for your interest in our paper!
person_per_image
is constant throughout images, you can understand it as the maximum number of persons PRTR wants to take in for each image. Your understanding is correct, it is just like the hs
in DETR, but only consider one single class of objects, person. @likenneth Thanks for your quick response!
Regarding the third question, I am going to firstly crop the feature maps with STN based on the detection person bounding boxes and then apply some DECONV layers for keypoint detection. So I guess I don't need PE, right?
Moreover, I have some questions about the multi-layer cropping with STN:
b
here the detection bounding box? i
and j
stand for? # STN cropping
y_grid = (h.unsqueeze(-1) @ self.y_interpolate + cy.unsqueeze(-1) * 2 - 1).unsqueeze(-1).unsqueeze(-1) # [person per image * B, y_res, 1, 1]
x_grid = (w.unsqueeze(-1) @ self.x_interpolate + cx.unsqueeze(-1) * 2 - 1).unsqueeze(-1).unsqueeze(1) # [person per image * B, 1, x_res, 1]
grid = torch.cat([x_grid.expand(-1, self.y_res, -1, -1), y_grid.expand(-1, -1, self.x_res, -1)], dim=-1)
It seems that you do not adopt the localisation net of the Spatial Transformer Network (STN), but only the grid generator and the sampler. I am not clear about the code for grid generation above. It would be very helpful that you could explain the code a little bit more.
Great thanks in advance!
Hi,
b
is the discussed bounding box, i
and j
are indices. Thanks a lot!
Hello, I have some questions about the sequential variant (annotated_prtr.ipynb):
In the
forward()
function in cell 4, the dimension ofhs
is[B, person_per_image, f]
. Isf
here the transformer's dimension (similar tod_model
in DETR's transformer)?For these two lines of code in preparation for STN feature cropping in cell 4:
I am a little bit confused by
person_per_image
, since the number of person is likely different in each image. Ishs
here similar to thehs
in DETR, whose dimension is[batch_size, num_queries, d_model]
?If I only need the cropped features and don't use Transformer (
transformer_kpt
) for further keypoint detection, do I have to build positional embeddings (as in cell 3) and apply them to STN feature cropping?Looking forward to your answers. Thanks in advance!