The codes in the channel alignment section seem not consistent with the paper.

Levantespot commented 2 years ago

In 3.2 of the paper:

Before the feature fusion, we embed each $F_i$ to the same number of channels $C_e$ by a 1×1 convolution, bilinearly upsample the features to the size of $F_1$, and concatenate them.

However, the model built from configs/daformer/gta2cs_uda_warm_fdthings_rcs_croppl_a999_daformer_mitb5_s0.py uses MLP instead of conv1:

...
(decode_head): DAFormerHead(
      input_transform=multiple_select, ignore_index=255, align_corners=False
      (loss_decode): CrossEntropyLoss()
      (conv_seg): Conv2d(256, 19, kernel_size=(1, 1), stride=(1, 1))
      (dropout): Dropout2d(p=0.1, inplace=False)
      (embed_layers): ModuleDict(
        (0): MLP(
          (proj): Linear(in_features=64, out_features=256, bias=True)
        )
        (1): MLP(
          (proj): Linear(in_features=128, out_features=256, bias=True)
        )
        (2): MLP(
          (proj): Linear(in_features=320, out_features=256, bias=True)
        )
        (3): MLP(
          (proj): Linear(in_features=512, out_features=256, bias=True)
        )
      )
      (fuse_layer) ...

I‘ll be grateful to your help.

lhoyer commented 2 years ago

The MLP layer as defined here https://github.com/lhoyer/DAFormer/blob/8d6e710700ff5e6a053c77bfe384ba44d4672cbe/mmseg/models/decode_heads/segformer_head.py#L18 flattens its input over all pixels so that the same linear layer is applied to all pixels. Therefore, it has the same behavior as a 1x1 convolution. You can also recognize that when looking at the number of in_features.

Levantespot commented 2 years ago

Thanks for your response!

lhoyer / DAFormer

The codes in the channel alignment section seem not consistent with the paper. #33