Closed 66Kevin closed 1 year ago
Hi Yueyi @66Kevin ,
Thanks for your interest in our work. As in https://github.com/microsoft/Cream/blob/19751fe6033f2522f300cd929cbdf0c8df3ef1cd/EfficientViT/classification/model/efficientvit.py#L303C1-L306 four conv-bn layers with stride 2 are applied, which means the input will be downsampled by 2x2x2x2=16. Hope this clarifies.
Thanks, Xinyu
Hi Yueyi @66Kevin ,
Thanks for your interest in our work. As in https://github.com/microsoft/Cream/blob/19751fe6033f2522f300cd929cbdf0c8df3ef1cd/EfficientViT/classification/model/efficientvit.py#L303C1-L306 four conv-bn layers with stride 2 are applied, which means the input will be downsampled by 2x2x2x2=16. Hope this clarifies.
Thanks, Xinyu
Hi Xinyu @xinyuliu-jeffrey, Thanks for your reply! Oops, my mistake. 4 conv2d_bn modules will downsample by a factor of 1/16. So would 4 consecutive conv2d_bn modules be equivalent to a single conv2d_bn module with kernel size = strides = 16? But that wouldn't ensure overlapping. So at what step can we achieve the overlapping mentioned in the paper? Thank you for your patience! I appreciate it.
Best wishes, Yueyi Wang
@66Kevin 3×3conv with stride 2 will have overlap during downsampling while 16×16conv with stride 16 does not. The following is a simple illustration.
Best, Xinyu
@66Kevin 3×3conv with stride 2 will have overlap during downsampling while 16×16conv with stride 16 does not. The following is a simple illustration.
Best, Xinyu
@xinyuliu-jeffrey That makes sense. Thank you so much!
Best, Yueyi
Hi, Thanks for your fantastic work! In the paper on the network architecture of EfficientViT, it mentions: "We introduce overlapping patch embedding to embed 16×16 patches into tokens with C1 dimension." That is, the image is transformed into multiple (H/16, W/16, C1) patches. Generally speaking, a convolutional layer with patch size equal to kernel size equal to stride is used to achieve this. However, in the source code, the patch embedding part seems to only downsample the image to 1/8 of the original image and does not do as mentioned in the paper. Why is this? Is there something wrong with my understanding?
Best wishes, Yueyi Wang