megvii-model / ShuffleNet-Series

MIT License
1.48k stars 276 forks source link

A mismatch in shufflenetv1 about ReLU #53

Open zjykzj opened 3 years ago

zjykzj commented 3 years ago

hi @nmaac @megvii-model , in ShuffleNetV1, there are two different modifications compared paper with this

one is for channel shuffle before/after dwconv, this is fixed in #16 #40

two is the usage of relu in shufflenetv1_unit.py here

        if self.stride == 1:
            return F.relu(x + x_proj)
        elif self.stride == 2:
            return torch.cat((self.branch_proj(x_proj), F.relu(x)), 1)

when stride=2, there is different with paper description, and this also different with our common usage. I want to know this is better design or just a mistake. Look forward to your reply

nmaac commented 3 years ago

The code is exactly the same with the paper (channel shuffle and relu). Please note that relu+avgpool+relu=relu+avgpool.

zjykzj commented 3 years ago

The code is exactly the same with the paper (channel shuffle and relu). Please note that relu+avgpool+relu=relu+avgpool.

nice work !!! Let me briefly describe your implementation

there are three Stage in ShuffleNetV1, the resolution downsampling operation is performed in the first layer of each stage (use AvgPool)

for the first stage, upstream operation implementation is Conv2d -> BN -> ReLU -> MaxPool, so there is no need to implement ReLU for identity map

for following stage, upstream operation implementation is Block like this

        if self.stride == 1:
            return F.relu(x + x_proj)

so ReLU + AvgPool = ReLU + AvgPool + Relu, so also no need to implement extra activation for identity map

zjykzj commented 3 years ago

There is a similar trick in shufflenetv2 implementation here. Paper describes block like this

figure-3

when stride=1, separating feature maps using channel split operations; this can be done by channel shuffle operation

zjykzj commented 3 years ago

so the last ShuffleV2Block's channel shuffle operation is ignored, right ?? @nmaac

nmaac commented 3 years ago

The last block is followed by a fully connected layer therefore the additional channel shuffle has no effect and can be omitted.

zjykzj commented 3 years ago

The last block is followed by a fully connected layer therefore the additional channel shuffle has no effect and can be omitted.

nice!!!