Severe performance drop with cunn and nngraph

Hi all, I'm experiencing severe performance problems using a nngraph built network on GPU. The network has many inputs processed with a series of TemporalConvolution, all the resulting vectors are joined together and processed with some nn.Linear layers. What I see is that the forward pass on GPU is approx 3 time slower than the same operation on CPU. The backward doesn't suffer so much, but it's still slower.

Can anyone give me some hint on the reason why / how to avoid the problem? Thanks

The guilty code

-- nngraph performance drop??
-- conf
n_input = 10
input_size = 400

-- build the net
require 'nngraph'

in_table = {}
proc = {}
nl = nn.ReLU

for ii=1,n_input do
    -- input
    in_table[ii] = nn.Identity()()

    -- proc
    proc[ii] = 
        in_table[ii]
        - nn.TemporalConvolution(1,10, 101, 1)
        - nl()
        - nn.TemporalConvolution(10,10, 101, 1)
        - nl()
        - nn.TemporalConvolution(10,1, 101, 1)
        - nn.View(100)
end

-- stich everything together
join = nn.JoinTable(1,1)(proc)
output =
    join
    - nn.Linear(n_input*100, 1000)
    - nl()
    - nn.Linear(1000,1000)
    - nl()
    - nn.Linear(1000,10)

-- final net
net = nn.gModule(in_table, {output})

function random_input(cuda)
    local input = {}
    for ii=1,n_input do
        input[ii] = torch.Tensor(input_size,1):normal()
        if cuda then
            input[ii] = input[ii]:cuda()
        end
    end

    return input
end

-- double
net:double()
N=100
cum_fwd = 0
cum_bwd = 0
for ii=1,N do
    local input = random_input()
    local grad = torch.Tensor(10):normal()
    local tmr = torch.Timer()
    net:forward(input)
    cum_fwd=cum_fwd+tmr:time().real
    --print("forward time: " .. tmr:time().real*1000 .. "ms")
    tmr:reset()
    net:backward(input, grad)
    cum_bwd=cum_bwd+tmr:time().real
    --print("backward time: " .. tmr:time().real*1000 .. "ms")
end
print("-- CPU --")
print("average forward time: " .. cum_fwd/N*1000 .. "ms")
print("average backward time: " .. cum_bwd/N*1000 .. "ms")

-- cuda
require 'cunn'

net:cuda()
N=100
cum_fwd = 0
cum_bwd = 0
for ii=1,N do
    local input = random_input(1)
    local grad = torch.Tensor(10):normal():cuda()
    local tmr = torch.Timer()
    net:forward(input)
    cum_fwd=cum_fwd+tmr:time().real
    --print("forward time: " .. tmr:time().real*1000 .. "ms")
    tmr:reset()
    net:backward(input, grad)
    cum_bwd=cum_bwd+tmr:time().real
    --print("backward time: " .. tmr:time().real*1000 .. "ms")
end

print("-- GPU --")
print("average forward time: " .. cum_fwd/N*1000 .. "ms")
print("average backward time: " .. cum_bwd/N*1000 .. "ms")

The hardware

OS: Ubuntu 14.04.5 LTS
GPU: NVidia Tesla K80, Driver 361.77
CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

My results:

-- CPU --
average forward time: 38.324513435364ms
average backward time: 63.394954204559ms
-- GPU -- average forward time: 212.77039051056ms
average backward time: 110.11774778366ms

Edit 1

Using mini batches worsen the problem

torch / cutorch