Closed aletote closed 8 years ago
Yes, it's not very fast on this particular model unfortunately. Lots of kernel launches with only a few thousand floats each time. I think a way forward on this would be kernel fusion, which I started on, but temporarily put to one side for now. You can look at the code I have so far at https://github.com/hughperkins/clnn/tree/fused-modules , https://github.com/hughperkins/clnn/tree/fusibles , and/or https://github.com/hughperkins/clnn/tree/connectors , and a description of approximately how this works at https://github.com/torch/nngraph/issues/60#issuecomment-126917549 .
Closing this, since it's a question really, right?
Hi, I get the same speed. Everything is working fine, no errors, but no speed improvement whatsoever when using the gpu. I'm on a brand new macbook pro. Can you give me some advice?
ales-MacBook-Pro:char-rnn ale$ th train.lua -opencl 1 -gpuid 0 using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an lstm with 2 layers
Using Apple , OpenCL platform: Apple Using OpenCL device: Iris Pro setting forget gate biases to 1 in LSTM layer 1 setting forget gate biases to 1 in LSTM layer 2 number of parameters in the model: 240321
cloning rnn cloning criterion
THClReduceAll.cl build log: