Open zjykzj opened 3 years ago
The code is exactly the same with the paper (channel shuffle and relu). Please note that relu+avgpool+relu=relu+avgpool.
The code is exactly the same with the paper (channel shuffle and relu). Please note that relu+avgpool+relu=relu+avgpool.
nice work !!! Let me briefly describe your implementation
there are three Stage in ShuffleNetV1, the resolution downsampling operation is performed in the first layer of each stage (use AvgPool)
for the first stage, upstream operation implementation is Conv2d -> BN -> ReLU -> MaxPool, so there is no need to implement ReLU for identity map
for following stage, upstream operation implementation is Block like this
if self.stride == 1:
return F.relu(x + x_proj)
so ReLU + AvgPool = ReLU + AvgPool + Relu, so also no need to implement extra activation for identity map
There is a similar trick in shufflenetv2 implementation here. Paper describes block like this
when stride=1, separating feature maps using channel split
operations; this can be done by channel shuffle
operation
so the last ShuffleV2Block's channel shuffle
operation is ignored, right ?? @nmaac
The last block is followed by a fully connected layer therefore the additional channel shuffle has no effect and can be omitted.
The last block is followed by a fully connected layer therefore the additional channel shuffle has no effect and can be omitted.
nice!!!
hi @nmaac @megvii-model , in ShuffleNetV1, there are two different modifications compared paper with this
one is for channel shuffle before/after dwconv, this is fixed in #16 #40
two is the usage of relu in shufflenetv1_unit.py here
when stride=2, there is different with paper description, and this also different with our common usage. I want to know this is better design or just a mistake. Look forward to your reply