valeoai / rangevit

Apache License 2.0
74 stars 6 forks source link

Ques about details. #5

Closed huixiancheng closed 7 months ago

huixiancheng commented 10 months ago
  1. Since PixelShuffle don't work on patch_embed like 2x8, use Rearrange instead of it. Altough look like the working way is very similar, how about it's effect on performance? Eg: Also use Rearrange on 8x8 downsample, shou it better or worser than PixelShuffle? Btw, It looks similar to PatchExpand in SwinUnet. https://github.com/valeoai/rangevit/blob/4484d35b656f8213d887754780a4aa1e23303ea3/models/decoders.py#L72-L79
  2. I'm not sure the useage of this F.interpolate ops, it's mean that the feature size of encoder-decoder output will be not same as input (in some time)? https://github.com/valeoai/rangevit/blob/4484d35b656f8213d887754780a4aa1e23303ea3/models/rangevit.py#L229-L231
  3. Since RangeViT only use final layer feature map to upsample and cat with skip connetction, I'm not sure this is the original idea to keep the model simple and efficient. Have you ever tried PUP or MLA structures similar to SETR? Mentioning this since only mentioned segmenter in paper.
huixiancheng commented 10 months ago

By the way, the lantency on Table 10 is test on sliding-window or whole range images? image

angelika1108 commented 8 months ago

Hi @huixiancheng! We are sorry for the late reply and we thank you very much for your questions!

  1. As PixelShuffle doesn't work for non-square scale_factor, we use einops Rearrange instead. As it is equivalent to PixelShuffle, it shouldn't affect the performance. Indeed, we could have used Rearrange on a square scale_factor, and it would have been equivalent to PixelShuffle. Yes, it's exactly like PatchExpand in SwinUnet (just the channel dimension is the last dimension in SwinUnet).

  2. We needed the F.interpolate() in the case of the "Linear decoder", which outputs a feature map with spatial dimension [H/P_H, W/P_W], where the patch size is [P_H, P_W]. So, the F.interpolate() bilinearly upsamples the feature map to the same spatial dimension as the input image [H, W]. You are right that we don't need the F.interpolate() in the case of the "UpConv" decoder as the spatial dimension of the output of the decoder is already [H, W]. This is because we use the same scale_factor as the patch_size. However, if someone wants to use a different (smaller) scale_factor than the patch_size, then the F.interpolate() is necessary to produce a feature map that has spatial dimension [H, W].

  3. Yes, we wanted to keep the model architecture as simple as possible. The "Linear decoder" is inspired by the "Linear decoder" of Segmenter (Sec. 3.2) but we use the "UpConv decoder" in our final model, which is still simple, but more efficient thanks to the PixelShuffle layer and skip connection. We haven't tried different decoder architectures, but it's a good idea for a future direction!

  4. The inference time in Table 10 refers to whole range images.

huixiancheng commented 7 months ago

Thx for your kind reply~!