All convolutions have kernels of size (3, 3) and are not dilated.
normalized keys and queries are used
All attention windows have size 8 by 8
32 attention heads
skip connections from the output of each MaxVit block to the output of MaxVit. More precisely, the final output of MaxVit is a linear transformation of the outputs after each sub-block (after summing with the residual branch)
MBConv in MaxVit uses the expansion rate of 4
squeeze-and-excitation in the MBConv have bottleneck ratio of 0.25.
pre-activation layer normalization throughout the network.
layer normalization after each convolution which is not the last convolution in the given sub-block.
GELU inside MBConv (in MaxVit) and ReLu in all other places.
stochastic depth in MaxVit with the probability of dropping a given sub-module (i.e. MBConv, local attention or gridded attention) increasing linearly thorough the network from 0 to 0.2.
For reference:
From the paper:
Tasks: