Hi all,
I'm experiencing severe performance problems using a nngraph built network on GPU.
The network has many inputs processed with a series of TemporalConvolution, all the resulting vectors are joined together and processed with some nn.Linear layers.
What I see is that the forward pass on GPU is approx 3 time slower than the same operation on CPU.
The backward doesn't suffer so much, but it's still slower.
Can anyone give me some hint on the reason why / how to avoid the problem?
Thanks
The guilty code
-- nngraph performance drop??
-- conf
n_input = 10
input_size = 400
-- build the net
require 'nngraph'
in_table = {}
proc = {}
nl = nn.ReLU
for ii=1,n_input do
-- input
in_table[ii] = nn.Identity()()
-- proc
proc[ii] =
in_table[ii]
- nn.TemporalConvolution(1,10, 101, 1)
- nl()
- nn.TemporalConvolution(10,10, 101, 1)
- nl()
- nn.TemporalConvolution(10,1, 101, 1)
- nn.View(100)
end
-- stich everything together
join = nn.JoinTable(1,1)(proc)
output =
join
- nn.Linear(n_input*100, 1000)
- nl()
- nn.Linear(1000,1000)
- nl()
- nn.Linear(1000,10)
-- final net
net = nn.gModule(in_table, {output})
function random_input(cuda)
local input = {}
for ii=1,n_input do
input[ii] = torch.Tensor(input_size,1):normal()
if cuda then
input[ii] = input[ii]:cuda()
end
end
return input
end
-- double
net:double()
N=100
cum_fwd = 0
cum_bwd = 0
for ii=1,N do
local input = random_input()
local grad = torch.Tensor(10):normal()
local tmr = torch.Timer()
net:forward(input)
cum_fwd=cum_fwd+tmr:time().real
--print("forward time: " .. tmr:time().real*1000 .. "ms")
tmr:reset()
net:backward(input, grad)
cum_bwd=cum_bwd+tmr:time().real
--print("backward time: " .. tmr:time().real*1000 .. "ms")
end
print("-- CPU --")
print("average forward time: " .. cum_fwd/N*1000 .. "ms")
print("average backward time: " .. cum_bwd/N*1000 .. "ms")
-- cuda
require 'cunn'
net:cuda()
N=100
cum_fwd = 0
cum_bwd = 0
for ii=1,N do
local input = random_input(1)
local grad = torch.Tensor(10):normal():cuda()
local tmr = torch.Timer()
net:forward(input)
cum_fwd=cum_fwd+tmr:time().real
--print("forward time: " .. tmr:time().real*1000 .. "ms")
tmr:reset()
net:backward(input, grad)
cum_bwd=cum_bwd+tmr:time().real
--print("backward time: " .. tmr:time().real*1000 .. "ms")
end
print("-- GPU --")
print("average forward time: " .. cum_fwd/N*1000 .. "ms")
print("average backward time: " .. cum_bwd/N*1000 .. "ms")
The hardware
OS: Ubuntu 14.04.5 LTS
GPU: NVidia Tesla K80, Driver 361.77
CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
My results:
-- CPU --
average forward time: 38.324513435364ms
average backward time: 63.394954204559ms
-- GPU --
average forward time: 212.77039051056ms
average backward time: 110.11774778366ms
I also realized that cunn's convolution operation has a bad performance. You could try to use cudnn, which enhance a lot. I have ever seen that cudnn performs 10 times faster than cnn when running convolution operation. See here for details
Hi all, I'm experiencing severe performance problems using a nngraph built network on GPU. The network has many inputs processed with a series of TemporalConvolution, all the resulting vectors are joined together and processed with some nn.Linear layers. What I see is that the forward pass on GPU is approx 3 time slower than the same operation on CPU. The backward doesn't suffer so much, but it's still slower.
Can anyone give me some hint on the reason why / how to avoid the problem? Thanks
The guilty code
The hardware
My results:
Edit 1
Using mini batches worsen the problem