microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.62k stars 220 forks source link

Question about Patch Embedding in EfficientViT #168

Closed 66Kevin closed 1 year ago

66Kevin commented 1 year ago

Hi, Thanks for your fantastic work! In the paper on the network architecture of EfficientViT, it mentions: "We introduce overlapping patch embedding to embed 16×16 patches into tokens with C1 dimension." That is, the image is transformed into multiple (H/16, W/16, C1) patches. Generally speaking, a convolutional layer with patch size equal to kernel size equal to stride is used to achieve this. However, in the source code, the patch embedding part seems to only downsample the image to 1/8 of the original image and does not do as mentioned in the paper. Why is this? Is there something wrong with my understanding?

Best wishes, Yueyi Wang

xinyuliu-jeffrey commented 1 year ago

Hi Yueyi @66Kevin ,

Thanks for your interest in our work. As in https://github.com/microsoft/Cream/blob/19751fe6033f2522f300cd929cbdf0c8df3ef1cd/EfficientViT/classification/model/efficientvit.py#L303C1-L306 four conv-bn layers with stride 2 are applied, which means the input will be downsampled by 2x2x2x2=16. Hope this clarifies.

Thanks, Xinyu

66Kevin commented 1 year ago

Hi Yueyi @66Kevin ,

Thanks for your interest in our work. As in https://github.com/microsoft/Cream/blob/19751fe6033f2522f300cd929cbdf0c8df3ef1cd/EfficientViT/classification/model/efficientvit.py#L303C1-L306 four conv-bn layers with stride 2 are applied, which means the input will be downsampled by 2x2x2x2=16. Hope this clarifies.

Thanks, Xinyu

Hi Xinyu @xinyuliu-jeffrey, Thanks for your reply! Oops, my mistake. 4 conv2d_bn modules will downsample by a factor of 1/16. So would 4 consecutive conv2d_bn modules be equivalent to a single conv2d_bn module with kernel size = strides = 16? But that wouldn't ensure overlapping. So at what step can we achieve the overlapping mentioned in the paper? Thank you for your patience! I appreciate it.

Best wishes, Yueyi Wang

xinyuliu-jeffrey commented 1 year ago

@66Kevin 3×3conv with stride 2 will have overlap during downsampling while 16×16conv with stride 16 does not. The following is a simple illustration. f15d5aab54bb3e85671c1132e363af79

Best, Xinyu

66Kevin commented 1 year ago

@66Kevin 3×3conv with stride 2 will have overlap during downsampling while 16×16conv with stride 16 does not. The following is a simple illustration. f15d5aab54bb3e85671c1132e363af79

Best, Xinyu

@xinyuliu-jeffrey That makes sense. Thank you so much!

Best, Yueyi