Closed lucasjinreal closed 2 years ago
Hi @jinfagang , PoolFormer here is just a tool to demonstrate MetaFormer. The implementation may not be efficient for industrial use. For example, nn.AvgPool2d
may not be optimized well in CUDA. It can be replaced with DW Conv self.token_mixer = nn.Conv2d(in_channels=dim, out_channels=dim, kernel_size=3, stride=1, padding=1, groups=dim)
to speed up. For GroupNorm, I still don't know how to speed up it currently.
@yuweihao Hi, does that means I should train whole model again if change nn.AvgPool2d
to DW conv
?
@jinfagang You don;t have to. In our experiment, replacing GN with BN, and then reimplement the Poollayer with a fixed, predefined DepthWise conv, gives us about 30% speed up, and the accuracy drop 1% on ImageNet. If use BN, you can fuse the Conv-BN to speed it up.
@chuong98 can you show your pretrained fixed DepthWise conv? how to reset it's weights?
Hi, @jinfagang , @chuong98 , I just found it seems that CUDA much prefers NHWC rather than NCHW [1]. However, NCHW is used by default in PyTorch and PoolFormer also uses this layout. This may also be optimized to further speed it up [2].
The figure is from [1].
[1] https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#tensor-layout [2] https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html
Recently I trained a transformer based instance seg model, tested with different backbone, here is the result and speed test:
batchsize is training batchsize. Why the speed of poolformer is the slowest one? is that normal?
Slower than pvtv2-b1 and precision less than it...