Closed JacobLau84 closed 4 years ago
Which hardware did you deploy on? ARM, CPU, or GPU?
For Arm and CPU, GhostNet is faster and more friendly.
Thank you for your reply.
I depoly on ARM CPU with tflite format. And I wrote a keras version of GhostModule. Is there anything wrong with this module?
import math
class GhostConv2D():
def __init__(self, oup, k_size=3, ratio=2, dw_size=3, s=1, ub=True):
super(GhostConv2D,self).__init__()
conv_out_channel = math.ceil(oup*1.0/ratio)
depthconv_out_channel = conv_out_channel * (ratio-1)
self.conv_1=tf.keras.layers.Conv2D(conv_out_channel,kernel_size=k_size,
strides=s, padding='same', use_bias=ub)
self.bn_1=tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False)
self.relu_1=tf.keras.layers.ReLU()
self.conv_2=tf.keras.layers.DepthwiseConv2D(dw_size,depth_multiplier=(ratio-1),padding='same',use_bias=ub)
self.bn_2=tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False)
self.relu_2=tf.keras.layers.ReLU()
self.concat=tf.keras.layers.Concatenate(axis=-1)
def call(self,x):
x1 = self.conv_1(x)
x1 = self.bn_1(x1)
x1 = self.relu_1(x1)
x2 = self.conv_2(x1)
x2 = self.bn_2(x2)
x2 = self.relu_2(x2)
x = self.concat([x1,x2])
return x
What baseline did you compare to? Ghost module v.s. Conv2d?
One question: why not merge BN&ReLU into Conv?
GhostConv2D(oup, k_size, s, use_bias=False)
Compare to the baseline module:
tf.keras.layers.Conv2D(oup, k_size, s, use_bias=False)
tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False)
tf.keras.layers.ReLU()
And I did not merge BN&ReLU into the baseline module. But ghost module is still slower than the baseline module.
We test it in single-threaded mode. What's your setting?
Did you test one single layer or an entire network?
Hi, I tested it on pytorch today and got the same result. I used the following neural network for the test. https://github.com/PingoLH/Pytorch-HarDNet/blob/master/hardnet.py#L38
I only changed the ConvLayer module for line 38 to the Ghost module.
# Ghost module
class ConvLayer(nn.Module):
def __init__(self, in_channels, out_channels, kernel=1, stride=1, dropout=0.1, bias=False, ratio=2, dw_size=3, relu=True):
super(ConvLayer, self).__init__()
inp = in_channels
oup = out_channels
kernel_size = kernel
self.oup = oup
init_channels = math.ceil(oup / ratio)
new_channels = init_channels*(ratio-1)
self.primary_conv = nn.Sequential(
nn.Conv2d(inp, init_channels, kernel_size, stride, kernel_size//2, bias=bias),
nn.BatchNorm2d(init_channels),
nn.ReLU6(inplace=True) if relu else nn.Sequential(),
)
self.cheap_operation = nn.Sequential(
nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=bias),
nn.BatchNorm2d(new_channels),
nn.ReLU6(inplace=True) if relu else nn.Sequential(),
)
def forward(self, x):
x1 = self.primary_conv(x)
x2 = self.cheap_operation(x1)
out = torch.cat([x1,x2], dim=1)
return out[:,:self.oup,:,:]
I then tested it with the following code:
import torch
#from hardnet_ghost import *
from hardnet import *
import time
model = HarDNet(True, 39, pretrained=False)
model.eval()
model = model.cpu()
input = torch.randn(1,3,224,224)
y = model(input.cpu())
t2 = time.time()
for i in range(100):
y = model(input.cpu())
t3 = time.time()
print('FPS: ' + str(100/(t3-t2)))
I tested it on the CPU and GPU.
CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (16 core)
Result: 60 FPS (original) 46 FPS(Ghost version)
GPU: RTX 2080ti
Result: 160 FPS (original) 100 FPS(Ghost version)
I didn't test it in single-threaded mode. Does the Ghost module only perform better in single-threaded mode?
The dynamic graph mechanism of PyTorch is unfiendly for deployment. TFLite/NCNN/Caffe/Darknet may be better.
Thank you for your reply.
I depoly on ARM CPU with tflite format. And I wrote a keras version of GhostModule. Is there anything wrong with this module?
import math class GhostConv2D(): def __init__(self, oup, k_size=3, ratio=2, dw_size=3, s=1, ub=True): super(GhostConv2D,self).__init__() conv_out_channel = math.ceil(oup*1.0/ratio) depthconv_out_channel = conv_out_channel * (ratio-1) self.conv_1=tf.keras.layers.Conv2D(conv_out_channel,kernel_size=k_size, strides=s, padding='same', use_bias=ub) self.bn_1=tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False) self.relu_1=tf.keras.layers.ReLU() self.conv_2=tf.keras.layers.DepthwiseConv2D(dw_size,depth_multiplier=(ratio-1),padding='same',use_bias=ub) self.bn_2=tf.keras.layers.BatchNormalization(epsilon=1e-05, momentum=0.9,fused=False) self.relu_2=tf.keras.layers.ReLU() self.concat=tf.keras.layers.Concatenate(axis=-1) def call(self,x): x1 = self.conv_1(x) x1 = self.bn_1(x1) x1 = self.relu_1(x1) x2 = self.conv_2(x1) x2 = self.bn_2(x2) x2 = self.relu_2(x2) x = self.concat([x1,x2]) return x
I used the keras code here to convert it to tflite format and tested it on raspberry PI 4, but it was still slower than the original version.
@iamhankai I found the problem. It's because of the different versions of tflite interpreter. The depthwise convolution from tflite interpreter-v1 is much slower than the version 2.
@iamhankai I found the problem. It's because of the different versions of tflite interpreter. The depthwise convolution from tflite interpreter-v1 is much slower than the version 2.
Good observation! This may help other researchers.
The neural network slowed down with only replace the Conv2d in my net with GhostModule. Did you get the same results?