Question about the sequential variant

mlpc-ucsd / PRTR

(CVPR 2021) PRTR: Pose Recognition with Cascade Transformers

Apache License 2.0

141 stars 29 forks source link

Question about the sequential variant #6

Closed EckoTan0804 closed 3 years ago

EckoTan0804 commented 3 years ago

Hello, I have some questions about the sequential variant (annotated_prtr.ipynb):

In the forward() function in cell 4, the dimension of hs is [B, person_per_image, f]. Is f here the transformer's dimension (similar to d_model in DETR's transformer)?
For these two lines of code in preparation for STN feature cropping in cell 4:
```
person_per_image = hs.size(1)
num_person = person_per_image * hs.size(0)
```
I am a little bit confused by person_per_image, since the number of person is likely different in each image. Is hs here similar to the hs in DETR, whose dimension is [batch_size, num_queries, d_model]?
If I only need the cropped features and don't use Transformer (transformer_kpt) for further keypoint detection, do I have to build positional embeddings (as in cell 3) and apply them to STN feature cropping?

Looking forward to your answers. Thanks in advance!

likenneth commented 3 years ago

Thanks for your interest in our paper!

Yes
person_per_image is constant throughout images, you can understand it as the maximum number of persons PRTR wants to take in for each image. Your understanding is correct, it is just like the hs in DETR, but only consider one single class of objects, person.
It depends. If you want feature of same spatial size, you need to do STN; if you are going to feed the feature into some space/order-insensitve architecture like Transformer, you want to add PE.

EckoTan0804 commented 3 years ago

@likenneth Thanks for your quick response!

Regarding the third question, I am going to firstly crop the feature maps with STN based on the detection person bounding boxes and then apply some DECONV layers for keypoint detection. So I guess I don't need PE, right?

Moreover, I have some questions about the multi-layer cropping with STN:

In the paper, Sec. 3.3

Is b here the detection bounding box?
What do i and j stand for?

Regarding the code of STN cropping in cell 4 of annotated_prtr.ipynb:

# STN cropping
y_grid = (h.unsqueeze(-1) @ self.y_interpolate + cy.unsqueeze(-1) * 2 - 1).unsqueeze(-1).unsqueeze(-1)  # [person per image * B, y_res, 1, 1]
x_grid = (w.unsqueeze(-1) @ self.x_interpolate + cx.unsqueeze(-1) * 2 - 1).unsqueeze(-1).unsqueeze(1)  # [person per image * B, 1, x_res, 1]
grid = torch.cat([x_grid.expand(-1, self.y_res, -1, -1), y_grid.expand(-1, -1, self.x_res, -1)], dim=-1)

It seems that you do not adopt the localisation net of the Spatial Transformer Network (STN), but only the grid generator and the sampler. I am not clear about the code for grid generation above. It would be very helpful that you could explain the code a little bit more.

Great thanks in advance!

likenneth commented 3 years ago

Hi,

Yes, b is the discussed bounding box, i and j are indices.
The localisation net in STN paper is used to predict the boudning box (possibly skewed), but we here already have bounding box detected so no need for that.

EckoTan0804 commented 3 years ago

Thanks a lot!