stillwaterman commented 2 years ago

您好，我看您的代码引用了一部分DETR项目的代码，我对于DETR的feature shape有点疑惑，不知道您能否解答一下。我看DETR中将image feature的shape从（batch, 256, h, w）reshape成了（hw,batch,256），我想问一下为什么要将batch放到第二个维度上，之后的attention计算是否就和batch size有关系了？因为我遇到的transformer的attention计算都是将batch放到第一个维度上，然后token之间计算attention,类似（batch,hw,256）这种shape。

在 #2 中，提到了当train的batch size=1, test的batch size=2, 模型的预测效果会变差，说原因是The reason is that it will never have encountered padding at train time, and thus will be confused when encountering it at test time. 我不理解这里的padding是什么意思？是batch size变大的意思吗？如果batch size对test阶段的预测有影响，是否说明（hw,batch,256）这种shape存在一定的问题？

tangjiuqi097 commented 2 years ago

您可以看看attention的源码，是把batch放在第二维的。

当batchsize大于1时，如果图片大小不一样，需要padding成一样的输入网络。

stillwaterman commented 2 years ago

我看了看源码，在attention中input是类似（hw,batch,256），但是计算q,k,v的时候又将shape变成了像（batch,hw,256），变换代码如下：

reshape q, k, v for multihead attention and make em batch first

q = q.contiguous().view(tgt_len, bsz * num_heads, head_dim).transpose(0, 1)

output又将shape变成了类似（hw,batch,256）代码如下： attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p) attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len * bsz, embed_dim) attn_output = linear(attn_output, out_proj_weight, out_proj_bias) attn_output = attn_output.view(tgt_len, bsz, attn_output.size(1)) 这是否有些多此一举？

关于padding，非常感谢您的回答，我完全理解了，感谢！ @tangjiuqi097

megvii-research / AnchorDETR

关于feature shape的问题 #40

reshape q, k, v for multihead attention and make em batch first