hughperkins / cltorch

An OpenCL backend for torch.
Other
289 stars 26 forks source link

OpenCL 7 times slower than CPU? #67

Closed coodoo closed 8 years ago

coodoo commented 8 years ago

While benchmarking following script I'm quite surprised to find when OpenCL was enabled, it's about seven times slower than running on CPU, any idea why?

require 'rnn'
require 'optim'
require 'cltorch'
require 'clnn'
cltorch.setDevice(1)

-- Hyperparameters
local inputSize = 5
local hiddenSize = 5
local nClasses = 4
local nIndex = 10
local maxSeqLen = 20
local nSamples = 50
local nEpochs = 10

-- Creating dummy dataset
local sentences = {}
local targets = {}
for i=1,nSamples do
  local seqLen = torch.random(maxSeqLen)
  local seq = torch.Tensor(seqLen):random(nIndex)
  local target = torch.random(nClasses)
  sentences[i] = seq
  targets[i] = target
end

-- Defining model
local rnn = nn.Sequential()
  :add(nn.LookupTable(nIndex, hiddenSize))
  :add(nn.SplitTable(1,2))
  :add(nn.Sequencer(nn.FastLSTM(hiddenSize, hiddenSize)))
  :add(nn.SelectTable(-1))
  :add(nn.Linear(hiddenSize, nClasses))
  :add(nn.LogSoftMax())
local criterion = nn.ClassNLLCriterion()
local params, gradParams = rnn:getParameters()

local useCL = true -- toggle OpenCL
if useCL then
    rnn:cl()
    criterion:cl()
end

-- Training and gradient checking
for e=1,nEpochs do

  for i=1,nSamples do

    xlua.progress(i, nSamples)

    local feval = function(x)
      rnn:zeroGradParameters()

      local s = sentences[i]
      local t = targets[i]
      if useCL then s = s:cl() end

      local output = rnn:forward( s )
      local loss = criterion:forward( output, t )
      local gradOutput = criterion:backward( output, t )
      local gradInput = rnn:backward( s, gradOutput)

      return loss, gradParams
    end

    local err = optim.sgd(feval, params)

  end
end

Result - OpenCL

Using Apple , OpenCL platform: Apple
Using OpenCL device: Iris
 [=============================================== 50/50 Tot: 2s613ms | Step: 53ms      
 [=============================================== 50/50 Tot: 1s504ms | Step: 30ms      
 [=============================================== 50/50 Tot: 1s500ms | Step: 30ms      
 [=============================================== 50/50 Tot: 1s525ms | Step: 31ms      
 [=============================================== 50/50 Tot: 1s502ms | Step: 30ms      
 [=============================================== 50/50 Tot: 1s549ms | Step: 31ms      
 [=============================================== 50/50 Tot: 1s549ms | Step: 31ms      
 [=============================================== 50/50 Tot: 1s507ms | Step: 30ms      
 [=============================================== 50/50 Tot: 1s505ms | Step: 30ms      
 [=============================================== 50/50 Tot: 1s496ms | Step: 30ms      

Result - CPU

 [============================================ 50/50 =>]Tot: 344ms | Step: 7ms         
 [============================================ 50/50 =>]Tot: 245ms | Step: 5ms         
 [============================================ 50/50 =>]Tot: 234ms | Step: 4ms         
 [============================================ 50/50 =>]Tot: 244ms | Step: 4ms         
 [============================================ 50/50 =>]Tot: 255ms | Step: 5ms         
 [============================================ 50/50 =>]Tot: 255ms | Step: 5ms         
 [============================================ 50/50 =>]Tot: 237ms | Step: 4ms         
 [============================================ 50/50 =>]Tot: 247ms | Step: 5ms         
 [============================================ 50/50 =>]Tot: 246ms | Step: 5ms         
 [============================================ 50/50 =>]Tot: 249ms | Step: 5ms         
hughperkins commented 8 years ago

So, I added at line 58:

print('s:size()', s:size())

And the result is:

s:size() 10
[torch.LongStorage of size 1]

It looks like you're only training a single sentence, with ~10 cahracters or so, at a time. This will be crazy insanely slow, because the program will spend all its time launching kernels, sending a small amount of data to the gpu, and the gpu will spend all its time waiting for data.

LSTMs in general are pretty hard to keep fed with data, but you should get a speed-up if you train in minibatches of eg 128 sentences at a time. You'll probably need to sort your sentences by length, and submit buckets of similar length inside each batch, padding as necessary.

coodoo commented 8 years ago

I was doing that on purpose to experiment with some ideas, previously I was training with mini batches of 50 sentences, each with 140 characters and padded, but for this experiment I'm trying to verify how variable lengths of sentences perform (as a side note, preliminary results trained on CPU show at least 100x lower err rate comparing to fixed-length version, which is pretty exciting! :D).

It was also my guess that time was mostly wasted on moving small amount of data to and from GPU, opened this issue just to see if there's any possible workarond :P

Out of curiosity, does this happen on CUDA too? and any suggestion if one really want to train massive data of variable length on GPU? Thanks!

hughperkins commented 8 years ago

any suggestion if one really want to train massive data of variable length on GPU

If you have tons of data, you can sort it into buckets, for different legnths. eg a bucket for lenght 5, a bucket for length 10, and so on. This is a pretty standard solution.

does this happen on CUDA too?

The time to enqueue a kernel varies with hardware, driver and so on. NVIDIA cards are pretty good at keeping the kernel enqueue times low.

Otherwise, Intel have a neat trick, if you're willing to play with OpenCL 2.0, and write a bunch of code to handle this, where they launch a gpu daemon process. It's a kernel that sits on the gpu, running all the time, with a single hardware thread. And it waits for the host application to communicate with it via shared virtual memory "SVM". They claim to massively reduce kernel launch times in this way. You can search "GPU daemon – Road to Zero Cost Submission", though I think this will mostly just point you at IWOCL sessions page for now http://www.iwocl.org/attend/sessions/

coodoo commented 8 years ago

Ah great idea, will experiment with the bucket thing first and see how it pans out, also will look into that Intel trick later (C++ is not really my strongest suit to say the least...)

As a far-far-stretched side note, have you played with the idea of OpenCL remoting to distribute computation among multiple GPUs and/or computers? As described here.

hughperkins commented 8 years ago

As a far-far-stretched side note, have you played with the idea of OpenCL remoting to distribute computation among multiple GPUs and/or computers?

If you're looking to do prediction, you dont need anything fancy, just spin up a box of g2 boxes.

If you're loooking to do trianing, splitting training across multiple gpus is hard, and typically pretty slow. Because you have to synchronize weights and so on, which is non-trivial.

coodoo commented 8 years ago

Thanks for the explanation, very helpful as always!

Saw this discussion here and kinda got a feel of how tricky the whole thing could be, quick quote:

At heart, the most common distributed training mechanism creates multiple "replicas" of the model -- each replica has a full copy. It splits the training data among the replicas, and then at the end of every batch, synchronizes the updates to the model weights between the replicas.

Btw, do you think it's a good idea to build a HPC in-house with Nvidia Titan or K20 GPU cards so that most experiments could be done much faster, also at the same time eliminating the pricey cost of g2.8xlarge spot instances (those were kinda strangely high these days)?

hughperkins commented 8 years ago

ec2 instances are cheaper if you're just going to train for a few hours, or even a few weeks. But if you're going to be training full-time for more than a month or two, having your own hardware will be much cheaper. ec2 k520s are also pretty ancient. Titan X is the fastest pre-Pascal card, as far as I know. You'll need to figure out cooling for them probably, otherwise thermal limits will cut in, and throttle back the speed. For datacentre, there are Titan X boxes around. I havent tried them (yet), though they look promising. There's a whole discussion of this on reddit, see https://www.reddit.com/r/MachineLearning/comments/3wk6wl/tips_on_buyingbuilding_a_computer_with_a_gpu_for/

coodoo commented 8 years ago

I've decided to go all in for deep learning and that was exactly my thought too (having in-house machines will beat g2 costs in the long run), read that reddit discussion a couple of times and came up with following configuration a while ago, I was that close to place the order but then I got a bunch of dirt cheap spot instances so...

Now the only thing holding me back is rumor has it Nvidia will announce new Pascal-based card during Computex 2016, will wait and see if there's faster cards than Titan and make the final decision :D

Out of curiosity, what hardwares do you have to train on a daily basis?

hughperkins commented 8 years ago

GTX1080's will be out soonish, and promise to be around twice as fast, per my understanding, but they might be four times the price. So, Titan X might continue to be an optimal balance of power and price in the short-term?

Note that since I see you have your own business, note that no-one is making AMD GPUs available on cloud currently. But there are a zillion NVIDIA GPU cloud providers (EC2, Penguin Computing, etc ... ). Performance of AMD R9-390x and AMD R9-Fury is arguably competitive with Titan X, for certain workloads, and I'd be very interested in having either or both available on a per-hourly basis, as per ec2. Maybe you might consider getting a box of 4, and dabble with seeing if anyone uses them?

coodoo commented 8 years ago

GTX1080 and friends are exactly the things I'm waiting for, but yes, if the price shoot through the roof I'll most certainly just fall back to Titan X, let alone it has 12Gb memory which is more friendly to larger models (last time I try to train a mini batch of 50 sentences of 300 words with my MBP Intel Iris graphic chips it simply crashed due of insufficient memory).

As to the AMD side of things, I'm more of a user than PaaS provider and just want to acquire reasonably-priced hardwares so that I could run experiments faster and verify as many biz ideas as possible, also I heard currently OpenCL/GPGPU performance on AMD gpus is a bit lackluster so I guess Nvidia will be the go to option for at least a year or two from now.

coodoo commented 8 years ago

@hughperkins FYI, GTX 1080 for $599, Christmas is a bit earlier this year :D

http://wccftech.com/nvidia-geforce-gtx-1080-launch/

hughperkins commented 8 years ago

Yes, nice :-)

gstoner commented 8 years ago

For Inference, you can run Fiji Nano at 75 Watts of Power, 3.2 GFlops and 450 GB/s of Memory Bandwidth, Tesla M4 at 75 watts is 2.2 GFLOPS and 88 GB/s memory bandwidth. What I can tell you with ROCm stack it is starting to push the performance of FIJI beyond what we saw in the past we test 470 GB/s of effective memory bandwidth with GCN tuned Assembly Blit kernel. With ROCm HCC compiler it now has assembler and disassembler.

Here is Roofline Performance Comparison for the two chips, using Mixbench one with HCC/HIP on Fiji R9 Nano and CUDA and NVCC on Titan X. The spikes on TitanX are cache effects

screen shot 2016-05-26 at 8 18 04 pm
gujunli commented 8 years ago

Hi Greg, The figure is interesting. Which benchmark did you ru, could you give more information? Thanks! Junli

Sent from my iPhone

On May 26, 2016, at 6:24 PM, Gregory Stoner notifications@github.com wrote:

For Inference, you can run Fiji Nano at 75 Watts of Power, 3.2 GFlops and 450 GB/s of Memory Bandwidth, Tesla M4 at 75 watts is 2.2 GFLOPS and 88 GB/s memory bandwidth. What I can tell you with ROCm stack it is starting to push the performance of FIJI beyond what we saw in the past we test 470 GB/s of effective memory bandwidth with GCN tuned Assembly Blit kernel. With ROCm HCC compiler it now has assembler and disassembler.

Here is Roofline Peforamnce Comparison fo the two chips, using Mixbench one with HCC/HIP on Fiji and CUDA and NVCC on Titan X.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

gstoner commented 8 years ago

It is Mixbench you can find it here, it was developed by Elias Konstantinidis you can find it at https://github.com/ekondis/mixbench

Here are few of his papers. https://www.researchgate.net/profile/Elias_Konstantinidis He has has done some nice work.

greg

On May 26, 2016, at 10:01 PM, Junli Gu notifications@github.com<mailto:notifications@github.com> wrote:

Hi Greg, The figure is interesting. Which benchmark did you ru, could you give more information? Thanks! Junli

Sent from my iPhone

On May 26, 2016, at 6:24 PM, Gregory Stoner notifications@github.com<mailto:notifications@github.com> wrote:

For Inference, you can run Fiji Nano at 75 Watts of Power, 3.2 GFlops and 450 GB/s of Memory Bandwidth, Tesla M4 at 75 watts is 2.2 GFLOPS and 88 GB/s memory bandwidth. What I can tell you with ROCm stack it is starting to push the performance of FIJI beyond what we saw in the past we test 470 GB/s of effective memory bandwidth with GCN tuned Assembly Blit kernel. With ROCm HCC compiler it now has assembler and disassembler.

Here is Roofline Peforamnce Comparison fo the two chips, using Mixbench one with HCC/HIP on Fiji and CUDA and NVCC on Titan X.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/hughperkins/cltorch/issues/67#issuecomment-222049301, or mute the threadhttps://github.com/notifications/unsubscribe/AD8DudgmwPWA3fVq--c8w0u5uRHZN2Jcks5qFl6FgaJpZM4INY21.