leondgarse / keras_cv_attention_models

Keras beit,caformer,CMT,CoAtNet,convnext,davit,dino,efficientdet,edgenext,efficientformer,efficientnet,eva,fasternet,fastervit,fastvit,flexivit,gcvit,ghostnet,gpvit,hornet,hiera,iformer,inceptionnext,lcnet,levit,maxvit,mobilevit,moganet,nat,nfnets,pvt,swin,tinynet,tinyvit,uniformer,volo,vanillanet,yolor,yolov7,yolov8,yolox,gpt2,llama2, alias kecam
MIT License
595 stars 95 forks source link

[CoAtNet] Strides should be used in the first conv layer for down-sampling in MBConv #31

Closed xskxzr closed 2 years ago

xskxzr commented 2 years ago

From eq (5) in the paper, strides=2 is used in the first conv layer down-sampling in MBConv.

However, in line 78-80 of coatnet.py strides=1 is used in the first conv while strides=strides is used in the depth conv.

leondgarse commented 2 years ago

Ya, I know that. It's told from the paper page 14:

This is different from the standard MBConv where the down-sampling is done by applying stride2 depthwise convolution to the inverted bottleneck hidden states. We later found using stride-2
depthwise convolution is helpful but slower when model is small but not so much when model scales

So it's another test-and-tell case. Both should work, and stride-2 DepthwiseConv may work slower and a bit better, but I havn't tested applying strides=2 on first conv.

leondgarse commented 2 years ago

A parameter use_dw_strides is added for CoAtNet, which can set False for using strides on Conv2D for MBConv. Default is True.