jcjohnson / torch-rnn

Efficient, reusable RNNs and LSTMs for torch
MIT License
2.5k stars 508 forks source link

OpenCL backend slower than CPU #11

Open mewo2 opened 8 years ago

mewo2 commented 8 years ago

Running the tinyshakespeare dataset with the default settings, I get timings of around 0.3s/iteration with CPU, but using the OpenCL backend I get more like 2.6s/iteration. These timings seem to be similar whether or not benchmarking is enabled. Running char-rnn the timings are approximately reversed (around 3s/iteration CPU, 0.3s/iteration GPU).

Running on OS X 10.10, with a Radeon 5770.

clausd commented 8 years ago

I get similar slowness on other tests

jcjohnson commented 8 years ago

OpenCL is slow for me too on a Titan X. I think the slowness is due to the OpenCL implementation of nn.LookupTable, but I'm not positive.

simopal6 commented 8 years ago

I have a similar issue with CUDA (I don't know if I should open another issue). Using the default configuration and tiny-shakespeare, I get ~0.05 s on CPU and ~0.08 s on GPU per iteration.

jcjohnson commented 8 years ago

What type of GPU are you using?

simopal6 commented 8 years ago

GeForce GTX TITAN X

simopal6 commented 8 years ago

Sorry, my bad, I just had a better look at the command-line parameters, and I thought that omitting "-gpu" would run on the CPU. With "-gpu -1" it is about 40-50 ms per iteration.

jtippett commented 8 years ago

Seeing this also on a AMD Radeon R9 M370X 2048 MB. Much slower than CPU, perhaps 10x as OP suggests. Makes me wish I'd bought a machine with nvidia!

vinhqdang commented 8 years ago

Hello

Do I need to install OpenCL before installing cltorch and clnn?

jcjohnson commented 8 years ago

I don't think you need to explicitly install OpenCL based on the cltorch installation instructions here:

https://github.com/hughperkins/cltorch#installation

On Fri, Apr 29, 2016 at 5:03 AM, Vinh Dang notifications@github.com wrote:

Hello

Do I need to install OpenCL before installing cltorch and clnn?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/jcjohnson/torch-rnn/issues/11#issuecomment-215693916

beZXphUB commented 7 years ago

I am still getting this problem using Intel Iris Graphics 550 1536 MB on a MBP.

CPU training per epoch is abt 0.15 - 0.20 and GPU with openCL abt 1.4.

Output: (py2) Charless-MBP:torch-rnn charles$ th train.lua -input_h5 data/tiny_shakespeare.h5 -input_json data/tiny_shakespeare.json -gpu_backend opencl -speed_benchmark 1

Using Apple , OpenCL platform: Apple Using OpenCL device: Intel(R) Iris(TM) Graphics 550 Running with OpenCL on GPU 0

Forward / Backward pass took 4.0332989692688 Epoch 1.00 / 50, i = 1 / 17800, loss = 4.178679 Forward / Backward pass took 2.0255770683289 Epoch 1.01 / 50, i = 2 / 17800, loss = 4.086461 Forward / Backward pass took 1.5219600200653 Epoch 1.01 / 50, i = 3 / 17800, loss = 3.945212 Forward / Backward pass took 1.3577451705933 Epoch 1.01 / 50, i = 4 / 17800, loss = 3.758727 Forward / Backward pass took 1.2509491443634 Epoch 1.01 / 50, i = 5 / 17800, loss = 3.587259 Forward / Backward pass took 1.254331111908
Epoch 1.02 / 50, i = 6 / 17800, loss = 3.492134 Forward / Backward pass took 1.258672952652
Epoch 1.02 / 50, i = 7 / 17800, loss = 3.403253 Forward / Backward pass took 1.1694939136505 Epoch 1.02 / 50, i = 8 / 17800, loss = 3.414152

Any advice on fixing it?

timbitz commented 6 years ago

I am also finding that using the CPU (-gpu -1 forward/backward pass takes ~0.15) is ~7x faster than using -gpu_backend opencl (forward/backward pass takes ~1.05) on my AMD Radeon R9 M395X.

Any ideas? This issue seems to have become quiet over the last year..?

maiamcc commented 6 years ago

++ (running on Intel Iris Graphics 8100 1536 MB on High Sierra (10.13.4) on an early 2015 MBP.

elspru commented 6 years ago

I think you guys are forgetting that the GPU clockspeed is like 1/3 of the CPU clock-speed, so that is why it takes 3 times longer to do a pass.

GPU's are good for if you have a large amount of work to do in parallel, because they generally have more cores than a CPU, particularly if you take advantage of the vector data types.