megvii-research / MOTRv2

[CVPR2023] MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors
Other
378 stars 47 forks source link

attn_mask not work #6

Closed fengxiuyaun closed 1 year ago

fengxiuyaun commented 1 year ago

https://github.com/megvii-research/MOTRv2/blob/be49b7336218e470c9ebcd34be54fe7eec702675/models/motr.py#L672

当gtboxes为none或为噪声的GT, frame_res['pred_logits']值差别很大。

fengxiuyaun commented 1 year ago

https://github.com/megvii-research/MOTRv2/blob/be49b7336218e470c9ebcd34be54fe7eec702675/models/motr.py#L672

当gtboxes为none或为噪声的GT, frame_res['pred_logits']值差别很大。

那里说里边有attn_mask。frame_res['pred_logits']前面的几个query'对应输出的置信度应该差不多才对。但是现在差别很多。感觉噪声GT影响了前面几个query

fengxiuyaun commented 1 year ago

https://github.com/megvii-research/MOTRv2/blob/be49b7336218e470c9ebcd34be54fe7eec702675/models/motr.py#L672

当gtboxes为none或为噪声的GT, frame_res['pred_logits']值差别很大。

那里说里边有attn_mask。frame_res['pred_logits']前面的几个query'对应输出的置信度应该差不多才对。但是现在差别很多。感觉噪声GT影响了前面几个query

这个地方是自注意力,能用atta_mask嘛?作者代码是不是搞错了

zyayoung commented 1 year ago

https://github.com/megvii-research/MOTRv2/blob/be49b7336218e470c9ebcd34be54fe7eec702675/models/motr.py#L672

当gtboxes为none或为噪声的GT, frame_res['pred_logits']值差别很大。

感谢提问,能否详细描述一下复现的流程和代码?我们刚刚检查过

  1. gtboxes为None还是为噪声的GT,不影响track/proposal query产生的hidden state的值
  2. 改变一个denoising query,会改变denoising queries的hidden state,但任然不影响track/proposal query产生的hidden state

这两点是符合预期的,在自注意力上加atta_mask是参考的DN-DETR


代码

hs, init_reference, inter_references, enc_outputs_class, enc_outputs_coord_unact = \
    self.transformer(srcs, masks, pos, query_embed, ref_pts=ref_pts,
                     mem_bank=track_instances.mem_bank, mem_bank_pad_mask=track_instances.mem_padding_mask, attn_mask=attn_mask)
query_embed[-1].add_(-10000)
print(hs[-1])
breakpoint()
hs, init_reference, inter_references, enc_outputs_class, enc_outputs_coord_unact = \
    self.transformer(srcs, masks, pos, query_embed, ref_pts=ref_pts,
                     mem_bank=track_instances.mem_bank, mem_bank_pad_mask=track_instances.mem_padding_mask, attn_mask=attn_mask)
print(hs[-1])
breakpoint()

结果

-> self.transformer(srcs, masks, pos, query_embed, ref_pts=ref_pts,
(Pdb) c
tensor([[[-0.0473, -0.1809, -0.1649,  ..., -0.3847,  0.4224, -0.1286],
         [-0.0995, -0.0424,  0.1443,  ..., -0.4425,  0.0478, -0.1900],
         [ 0.0369,  0.0160,  0.1610,  ..., -0.4706, -0.9238, -0.2897],
         ...,
         [ 0.2081, -0.1846, -0.3163,  ..., -0.5717,  0.3413, -0.0820],
         [-0.1503, -0.0813, -0.4625,  ..., -0.0906,  0.6473, -0.1895],
         [-0.1096, -0.1370, -0.3655,  ..., -0.2530,  1.3197, -0.1997]]],
       device='cuda:0')
> /data/projects/motr_eccvw_to_publish/models/motr.py(544)_forward_single_image()
-> self.transformer(srcs, masks, pos, query_embed, ref_pts=ref_pts,
(Pdb) c
tensor([[[-0.0473, -0.1809, -0.1649,  ..., -0.3847,  0.4224, -0.1286],
         [-0.0995, -0.0424,  0.1443,  ..., -0.4425,  0.0478, -0.1900],
         [ 0.0369,  0.0160,  0.1610,  ..., -0.4706, -0.9238, -0.2897],
         ...,
         [ 0.2383, -0.1272, -0.1161,  ..., -0.6548, -0.1152, -0.1674],
         [-0.0791, -0.0713, -0.2048,  ..., -0.2818,  0.4432, -0.0805],
         [-0.1542, -0.1183, -0.5157,  ..., -0.2627,  1.2313, -0.2040]]],
       device='cuda:0')
> /data/projects/motr_eccvw_to_publish/models/motr.py(549)_forward_single_image()
fengxiuyaun commented 1 year ago

@zyayoung 谢谢作者回复。

修改代码前: self._forward_single_image(frame, tmp, None)['pred_logits'] tensor([[[-3.2966], [-4.3049], [-4.5905], [-3.0246], [-3.3853], [-3.2107], [-3.1802], [-2.6938], [-2.6116], [-3.3340], [-2.4464], [-2.4135], [-2.3779]]], device='cuda:0') self._forward_single_image(frame, tmp, gtboxes)['pred_logits'] tensor([[[-6.0443], [-6.4851], [-6.7420], [-6.3519], [-5.3412], [-6.6307], [-6.4254], [-6.2463], [-6.3488], [-5.3368], [ 3.7604], [ 3.5694], [ 3.9735], [-5.1342], [-5.3880], [-5.1033]]], device='cuda:0')

我通过https://github.com/megvii-research/MOTRv2/blob/be49b7336218e470c9ebcd34be54fe7eec702675/models/deformable_transformer_plus.py#L356修改代码后:( def _forward_self_cross(self, tgt, query_pos, reference_points, src, src_spatial_shapes, level_start_index, src_padding_mask=None, attn_mask=None):

    # self attention
    if attn_mask is not None:
        len_n_dt = sum(attn_mask[0]==False)
        tgt = torch.cat([self._forward_self_attn(tgt[:, :len_n_dt], query_pos[:, :len_n_dt]), self._forward_self_attn(tgt[:, len_n_dt:], query_pos[:, len_n_dt:])], dim=1)
    else:
        tgt = self._forward_self_attn(tgt, query_pos, attn_mask)
    # cross attention)

self._forward_single_image(frame, tmp, gtboxes)['pred_logits'] tensor([[[-3.2966], [-4.3049], [-4.5905], [-3.0246], [-3.3853], [-3.2107], [-3.1802], [-2.6938], [-2.6116], [-3.3340], [-2.4464], [-2.4135], [-2.3779], [-1.8376], [-2.1647], [-1.9602]]], device='cuda:0') self._forward_single_image(frame, tmp, None)['pred_logits'] tensor([[[-3.2966], [-4.3049], [-4.5905], [-3.0246], [-3.3853], [-3.2107], [-3.1802], [-2.6938], [-2.6116], [-3.3340], [-2.4464], [-2.4135], [-2.3779]]], device='cuda:0')

fengxiuyaun commented 1 year ago

我认为是自注意力模块不能使用atta_mask。原因:自注意力模型每个Q的输出与所有K和V相关(atta_mask仅能屏蔽其他Q值,但屏蔽不到其他K和V值)

fengxiuyaun commented 1 year ago

不好意思,这个bug发现了。这是因为我使用的是pytroch1.5. pytorch1.5 attn_mask输入必须是float类型,不能是bool型