Closed huixiancheng closed 7 months ago
By the way, the lantency on Table 10 is test on sliding-window or whole range images?
Hi @huixiancheng! We are sorry for the late reply and we thank you very much for your questions!
As PixelShuffle doesn't work for non-square scale_factor
, we use einops Rearrange instead. As it is equivalent to PixelShuffle, it shouldn't affect the performance. Indeed, we could have used Rearrange on a square scale_factor
, and it would have been equivalent to PixelShuffle. Yes, it's exactly like PatchExpand in SwinUnet (just the channel dimension is the last dimension in SwinUnet).
We needed the F.interpolate()
in the case of the "Linear decoder", which outputs a feature map with spatial dimension [H/P_H, W/P_W]
, where the patch size is [P_H, P_W]
. So, the F.interpolate()
bilinearly upsamples the feature map to the same spatial dimension as the input image [H, W]
. You are right that we don't need the F.interpolate()
in the case of the "UpConv" decoder as the spatial dimension of the output of the decoder is already [H, W]
. This is because we use the same scale_factor
as the patch_size
. However, if someone wants to use a different (smaller) scale_factor
than the patch_size
, then the F.interpolate()
is necessary to produce a feature map that has spatial dimension [H, W]
.
Yes, we wanted to keep the model architecture as simple as possible. The "Linear decoder" is inspired by the "Linear decoder" of Segmenter (Sec. 3.2) but we use the "UpConv decoder" in our final model, which is still simple, but more efficient thanks to the PixelShuffle layer and skip connection. We haven't tried different decoder architectures, but it's a good idea for a future direction!
The inference time in Table 10 refers to whole range images.
Thx for your kind reply~!