megvii-research / PETR

[ECCV2022] PETR: Position Embedding Transformation for Multi-View 3D Object Detection & [ICCV2023] PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images
Other
852 stars 132 forks source link

3D Position Encoder implementation question in code. #114

Open Capchenxi opened 1 year ago

Capchenxi commented 1 year ago

To whom it may concern,

Really appreciate your work on PETR! I have a question on the implementation of 3D Position Encoder. In section 3.3 in paper, it is described F_3d as the function out put of F_2d and P_3d. However, when I look into the codes in petr_head.py, the final position embedding is got from the sin_position_encoding of mask and 3D positional embedding. To my understanding, the mask is equivalent to the 2D position rather 2D feature, which means there is no 2D feature used in final 3D position encoder module. And if I want to be consistent with the paper, the mask can be substitute by mlvl_feats to reproduce the position aware features. Is this correct? Thanks.

yingfei1016 commented 1 year ago

Hi, Thank you for your concern. 3D positional embedding is not an independent module in the code. In fact, 2D features will be added with 3D PE in multi-head attention layer (https://github.com/open-mmlab/mmcv/blob/v1.4.0/mmcv/cnn/bricks/transformer.py#L178).

_The final position embedding is got from the sin_position_encoding of mask and 3D positional embedding._ In paper, we do the ablation, sin_position_encoding will further increase the performance by 0.5%.

1yuechu1 commented 4 months ago

Hi, Thank you for your concern. 3D positional embedding is not an independent module in the code. In fact, 2D features will be added with 3D PE in multi-head attention layer (https://github.com/open-mmlab/mmcv/blob/v1.4.0/mmcv/cnn/bricks/transformer.py#L178).

_The final position embedding is got from the sin_position_encoding of mask and 3D positional embedding._ In paper, we do the ablation, sin_position_encoding will further increase the performance by 0.5%.

你好,我观察到论文中需要将3D坐标通过两个MLP结构,但是从代码中发现2D特征和3D位置信息是直接进行相加,所以在代码中是没有使用MLP结构吗,谢谢