nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

Why 'base_xyz=cluster_xyz' in 'self.prediction_heads[i](...)' #35

Closed linhaojia13 closed 1 year ago

linhaojia13 commented 1 year ago

In the bdetr.py, you set base_xyz=cluster_xyz (the original coordinates of these queries) for function self.prediction_heads[i](...). The codes are shown as:

            # step box Prediction head
            base_xyz, base_size = self.prediction_heads[i](
                query.transpose(1, 2).contiguous(),     # ([B, F=288, V=256])
                base_xyz=cluster_xyz,                   # ([B, 256, 3])
                end_points=end_points,  # 
                prefix=prefix
            )

However, for the self.decoder[i](...), the query_pos is generated based on the output base_xyz by previous self.seg_prediction_heads[i] layer. The codes are shown as:

        for i in range(self.num_decoder_layers):
            prefix = 'last_' if i == self.num_decoder_layers-1 else f'{i}head_'

            # Position Embedding for Self-Attention
            if self.self_position_embedding == 'none':
                query_pos = None
            elif self.self_position_embedding == 'xyz_learned':
                query_pos = base_xyz
            elif self.self_position_embedding == 'loc_learned':
                query_pos = torch.cat([base_xyz, base_size], -1)
            else:
                raise NotImplementedError

            # step Transformer Decoder Layer
            query = self.decoder[i](
                query, points_features.transpose(1, 2).contiguous(),
                text_feats, query_pos,
                query_mask,
                text_padding_mask,
                detected_feats=(
                    detected_feats if self.butd
                    else None
                ),
                detected_mask=detected_mask if self.butd else None
            )  # (B, V, F)

The above codes for self.prediction_heads[i] implies that each prediction head at every layer is modifying the original queries coordinates. On the other hand, the codes for self.decoder[i](...) indicates that each decoder layer is modeling the process of modifying the queries coordinates from the previous layer output. This suggests that the modeling processes in these two places are not consistent. I think parameters in the prediction head should be modified, specifically setting base_xyz=base_xyz in self.prediction_heads[i](...). What are your thoughts on my suggestion?

ayushjain1144 commented 1 year ago

You're right and this seemingly non-intuitive design comes from group-free (See a related issue). We had tried changing it to what you suggested but I think it resulted in ~1% drop in performance.

linhaojia13 commented 1 year ago

Thank you for providing the relevant experimental results! I may explore this issue in the near future, so let's keep it in an open state for now.