microsoft / CSWin-Transformer

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped, CVPR 2022
MIT License
539 stars 78 forks source link

The results of downstream task by my realization are poor #17

Closed Sunting78 closed 2 years ago

Sunting78 commented 2 years ago

There are some questions:

  1. the split size is still [1 2 7 7]?
  2. last stage branch_num is 2 or 1 ? The downstream task image resolution in last stages cannot equal to 7(split size). If not 1, the pretrained weights size is not matched
  3. pading is right in my realization ? pad_l = pad_t = 0 pad_r = (W_sp - W % W_sp) % W_sp pad_b = (H_sp - H % H_sp) % H_sp q = q.transpose(-2,-1).contiguous().view(B, H, W, C) k = q.transpose(-2,-1).contiguous().view(B, H, W, C) v = q.transpose(-2,-1).contiguous().view(B, H, W, C) if pad_r > 0 or pad_b > 0: q = F.pad(q, (0, 0, pad_l, pad_r, pad_t, pad_b)) k = F.pad(k, (0, 0, pad_l, pad_r, pad_t, pad_b)) v = F.pad(v, (0, 0, pad_l, pad_r, pad_t, padb)) , Hp, Wp, _ = q.shape
LightDXY commented 2 years ago

code and models released, please create a new issue for any new problems.

LUO77123 commented 2 years ago

有一些问题:

  1. 拆分大小仍然是 [1 2 7 7]?
  2. 最后阶段 branch_num 是 2 还是 1 ?最后阶段的下游任务图像分辨率不能等于7(拆分大小)。如果不是 1,则预训练的权重大小不匹配
  3. pading 在我看来是正确的吗? pad_l = pad_t = 0 pad_r = (W_sp - W % W_sp) % W_sp pad_b = (H_sp - H % H_sp) % H_sp q = q.transpose(-2,-1).contiguous().view(B, H, W, C) k = q.transpose(-2,-1).contiguous().view(B, H, W, C) v = q.transpose(-2,-1).contiguous().view( B, H, W, C) 如果 pad_r > 0 或 pad_b > 0: q = F.pad(q, (0, 0, pad_l, pad_r, pad_t, pad_b)) k = F.pad(k, (0, 0, pad_l, pad_r, pad_t, pad_b)) v = F.pad(v, (0, 0, pad_l, pad_r, pad_t, padb)) , Hp, Wp, _ = q.shape

我和你一样,下游任务效果还行,按照源码分析,如果输入图像较小,导致最后特征图下采样32倍,与窗口大小一致,最后阶段 branch_num 是1 ,如果下游任务,输入图像较大,最后特征图下采样32倍,大于窗口大小,最后阶段 branch_num 是2 ,但是作者没有回答这个,

wujiang0156 commented 1 year ago

@LUO77123 @LightDXY @Sunting78 the same as you