The neural network slowed down with only replace the Conv2d in my net with GhostModule.

JacobLau84 commented 4 years ago

The neural network slowed down with only replace the Conv2d in my net with GhostModule. Did you get the same results?

iamhankai commented 4 years ago

Which hardware did you deploy on? ARM, CPU, or GPU?

For Arm and CPU, GhostNet is faster and more friendly.

JacobLau84 commented 4 years ago

Thank you for your reply.

I depoly on ARM CPU with tflite format. And I wrote a keras version of GhostModule. Is there anything wrong with this module?

import math

class GhostConv2D():
    def __init__(self, oup, k_size=3, ratio=2, dw_size=3, s=1, ub=True):
        super(GhostConv2D,self).__init__()

        conv_out_channel = math.ceil(oup*1.0/ratio)
        depthconv_out_channel = conv_out_channel * (ratio-1)

        self.conv_1=tf.keras.layers.Conv2D(conv_out_channel,kernel_size=k_size,
                                           strides=s, padding='same', use_bias=ub)
        self.bn_1=tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False)
        self.relu_1=tf.keras.layers.ReLU()
        self.conv_2=tf.keras.layers.DepthwiseConv2D(dw_size,depth_multiplier=(ratio-1),padding='same',use_bias=ub)
        self.bn_2=tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False)
        self.relu_2=tf.keras.layers.ReLU()
        self.concat=tf.keras.layers.Concatenate(axis=-1)

    def call(self,x):

        x1 = self.conv_1(x)
        x1 = self.bn_1(x1)        
        x1 = self.relu_1(x1)
        x2 = self.conv_2(x1)
        x2 = self.bn_2(x2)
        x2 = self.relu_2(x2)
        x = self.concat([x1,x2])

        return x

iamhankai commented 4 years ago

What baseline did you compare to? Ghost module v.s. Conv2d?

iamhankai commented 4 years ago

One question: why not merge BN&ReLU into Conv?

JacobLau84 commented 4 years ago

GhostConv2D(oup, k_size, s, use_bias=False)

Compare to the baseline module:

tf.keras.layers.Conv2D(oup, k_size, s, use_bias=False)
tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False)
tf.keras.layers.ReLU()

And I did not merge BN&ReLU into the baseline module. But ghost module is still slower than the baseline module.

iamhankai commented 4 years ago

We test it in single-threaded mode. What's your setting?

Did you test one single layer or an entire network?

JacobLau84 commented 4 years ago

Hi, I tested it on pytorch today and got the same result. I used the following neural network for the test. https://github.com/PingoLH/Pytorch-HarDNet/blob/master/hardnet.py#L38

I only changed the ConvLayer module for line 38 to the Ghost module.

# Ghost module
class ConvLayer(nn.Module):
    def __init__(self, in_channels, out_channels, kernel=1, stride=1, dropout=0.1, bias=False, ratio=2, dw_size=3,  relu=True):
        super(ConvLayer, self).__init__()
        inp = in_channels
        oup = out_channels
        kernel_size = kernel
        self.oup = oup
        init_channels = math.ceil(oup / ratio)
        new_channels = init_channels*(ratio-1)

        self.primary_conv = nn.Sequential(
            nn.Conv2d(inp, init_channels, kernel_size, stride, kernel_size//2, bias=bias),
            nn.BatchNorm2d(init_channels),
            nn.ReLU6(inplace=True) if relu else nn.Sequential(),
        )

        self.cheap_operation = nn.Sequential(
            nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=bias),
            nn.BatchNorm2d(new_channels),
            nn.ReLU6(inplace=True) if relu else nn.Sequential(),
        )
    def forward(self, x):
        x1 = self.primary_conv(x)
        x2 = self.cheap_operation(x1)
        out = torch.cat([x1,x2], dim=1)
        return out[:,:self.oup,:,:]

I then tested it with the following code:

import torch
#from hardnet_ghost import *
from hardnet import *
import time

model = HarDNet(True, 39, pretrained=False)
model.eval()
model = model.cpu()
input = torch.randn(1,3,224,224)
y = model(input.cpu())

t2 = time.time()
for i in range(100):
    y = model(input.cpu())

t3 = time.time()
print('FPS: ' + str(100/(t3-t2)))

I tested it on the CPU and GPU.

CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (16 core)
Result: 60 FPS (original) 46 FPS(Ghost version)
GPU: RTX 2080ti
Result: 160 FPS (original) 100 FPS(Ghost version)

I didn't test it in single-threaded mode. Does the Ghost module only perform better in single-threaded mode?

iamhankai commented 4 years ago

The dynamic graph mechanism of PyTorch is unfiendly for deployment. TFLite/NCNN/Caffe/Darknet may be better.

Reference: https://github.com/AlexeyAB/darknet/issues/4406

JacobLau84 commented 4 years ago

Thank you for your reply.

I depoly on ARM CPU with tflite format. And I wrote a keras version of GhostModule. Is there anything wrong with this module?

import math

class GhostConv2D():
    def __init__(self, oup, k_size=3, ratio=2, dw_size=3, s=1, ub=True):
        super(GhostConv2D,self).__init__()

        conv_out_channel = math.ceil(oup*1.0/ratio)
        depthconv_out_channel = conv_out_channel * (ratio-1)

        self.conv_1=tf.keras.layers.Conv2D(conv_out_channel,kernel_size=k_size,
                                           strides=s, padding='same', use_bias=ub)
        self.bn_1=tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False)
        self.relu_1=tf.keras.layers.ReLU()
        self.conv_2=tf.keras.layers.DepthwiseConv2D(dw_size,depth_multiplier=(ratio-1),padding='same',use_bias=ub)
        self.bn_2=tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False)
        self.relu_2=tf.keras.layers.ReLU()
        self.concat=tf.keras.layers.Concatenate(axis=-1)

    def call(self,x):

        x1 = self.conv_1(x)
        x1 = self.bn_1(x1)        
        x1 = self.relu_1(x1)
        x2 = self.conv_2(x1)
        x2 = self.bn_2(x2)
        x2 = self.relu_2(x2)
        x = self.concat([x1,x2])

        return x

I used the keras code here to convert it to tflite format and tested it on raspberry PI 4, but it was still slower than the original version.

JacobLau84 commented 4 years ago

@iamhankai I found the problem. It's because of the different versions of tflite interpreter. The depthwise convolution from tflite interpreter-v1 is much slower than the version 2.

iamhankai commented 4 years ago

@iamhankai I found the problem. It's because of the different versions of tflite interpreter. The depthwise convolution from tflite interpreter-v1 is much slower than the version 2.

Good observation! This may help other researchers.

iamhankai / ghostnet.pytorch

The neural network slowed down with only replace the Conv2d in my net with GhostModule. #32