Segmentation Fault on OS X using GPU and clnn

OS X 64bit, GPU is Radeon R9 M370X 2G memory, clnn is install from https://github.com/hughperkins/distro-cl

I can successfully run th neural_style.lua -print_iter 1 -gpu -1 on CPU, and th neural_style.lua -print_iter 1 -backend clnn -gpu 0 with Iris pro GPU

But if I run th neural_style.lua -print_iter 1 -backend clnn -gpu 1 with M370X GPU

It leads to Segmentation Fault:

libthclnn_searchpath    /Users/mymac/torch-cl/install/lib/lua/5.1/libTHCLNN.so
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded models/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
Using Apple , OpenCL platform: Apple
Using OpenCL device: AMD Radeon R9 M370X Compute Engine
Setting up style layer      2   :   relu1_1 
Setting up style layer      7   :   relu2_1 
THClReduceAll.cl build log: 
<program source>:11:10: warning: unused variable 'in1'
  float *in1 = &_in1;
         ^
<program source>:12:10: warning: unused variable 'out'
  float *out = &_out;
         ^

Segmentation fault: 11

jcjohnson / neural-style

Segmentation Fault on OS X using GPU and clnn #290