hughperkins / cltorch

An OpenCL backend for torch.
Other
289 stars 26 forks source link

char-rnn: model_utils.lua:76: bad argument #1 to ‘set' #7

Closed Ambext closed 8 years ago

Ambext commented 9 years ago

Coming from issue #5

hughperkins commented 9 years ago

(Basically, for anyone else looking at this issue, some more detali, from earlier thread: when running Karpathy's char-rnn, using cltorch/opencl, on an AMD M295X device, following error:

output is now
Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 1
registering spatialconvolutionmm
using OpenCL on GPU 1...

loading data files...

cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0

vocab size: 65

creating an LSTM with 2 layers

Using Apple platform: Apple
Using device: AMD Radeon R9 M295X Compute Engine
statefultimer v0.6
THClApply.cl build log:
:29:7: warning: no previous prototype for function 'reduceOp'
float reduceOp(float _in1, float _in2) {
^
:49:6: warning: no previous prototype for function 'TensorInfo_isContiguous'
bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) {
^
:57:14: warning: no previous prototype for function 'IndexToOffset_998_get'
unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) {
^
:61:14: warning: no previous prototype for function 'IndexToOffset_999_get'
unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) {
^
:76:14: warning: no previous prototype for function 'getLinearBlockId'
unsigned int getLinearBlockId() {
^
:84:7: warning: no previous prototype for function 'reduceBlock'
float reduceBlock( local float* smem,
^
:92:23: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned int') and 'int'
if (get_local_id(0) < numVals) {
~~~~~~~~~~~~~~~ ^ ~~~~~~~
:99:31: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned int') and 'int'
float r = get_local_id(0) < numVals ? smem[get_local_id(0)] : init;
~~~~~~~~~~~~~~~ ^ ~~~~~~~
:198:6: warning: no previous prototype for function 'op'
void op( global float *out
^

/Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:76: bad argument #1 to 'set' (expecting number or Tensor or Storage)
stack traceback:
[C]: in function 'set'
./util/model_utils.lua:76: in function 'flatten'
./util/model_utils.lua:103: in function 'combine_all_parameters'
train.lua:160: in main chunk
[C]: in function 'dofile'
...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x010fd87320

)

Ambext commented 9 years ago

Having cloned the branch no-prototype warnings

git clone https://github.com/hughperkins/cltorch.git -b no-prototype-warnings cltorch-warnings cd cltorch-warnings luarocks make rocks/cltorch-scm-1.rockspec

Executing train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 1

Yields Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 1 registering spatialconvolutionmm using OpenCL on GPU 1...
loading data files...
cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
createForIndexedgpu gpu=1 createForindexedgpu gpu=1 verbose=1 gpu=1 currentGpuIndex=0 num_devices=1 /Users/mnemonis/torch/install/bin/luajit: C++ exception Exowide:char-rnn mnemonis$

hughperkins commented 9 years ago

oh, can you use -gpuid 0 now pelase :-P (latest version ignores cpu-type devices now)

Ambext commented 9 years ago

Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0 registering spatialconvolutionmm using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
createForIndexedgpu gpu=0 createForindexedgpu gpu=0 verbose=1 gpu=0 currentGpuIndex=0 num_devices=1 Using Apple platform: Apple Using device: AMD Radeon R9 M295X Compute Engine statefultimer v0.6 THClApply.cl build log:

:29:7: warning: no previous prototype for function 'reduceOp' float reduceOp(float _in1, float _in2) { ^ :49:6: warning: no previous prototype for function 'TensorInfo_isContiguous' bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { ^ :57:14: warning: no previous prototype for function 'IndexToOffset_998_get' unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { ^ :61:14: warning: no previous prototype for function 'IndexToOffset_999_get' unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { ^ :76:14: warning: no previous prototype for function 'getLinearBlockId' unsigned int getLinearBlockId() { ^ :84:7: warning: no previous prototype for function 'reduceBlock' float reduceBlock( local float\* smem, ^ :92:23: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned int') and 'int' if (get_local_id(0) < numVals) { ~~~~~~~~~~~~~~~ ^ ~~~~~~~ :99:31: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned int') and 'int' float r = get_local_id(0) < numVals ? smem[get_local_id(0)] : init; ~~~~~~~~~~~~~~~ ^ ~~~~~~~ :198:6: warning: no previous prototype for function 'op' void op( global float *out ^ kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional `reduceDim` indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float\* offset; 52: // specialized on `Dims` to reduce nvcc compilation time 53: 54: 55: unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { 60: unsigned int offset = info.offset; 61: 62: // Use dynamic dims 63: for (int i = info.dims - 1; i >= 0; --i) { 64: unsigned int curDimIndex = linearId % info.sizes[i]; 65: unsigned int curDimOffset = curDimIndex \* info.strides[i]; 66: offset += curDimOffset; 67: 68: linearId /= info.sizes[i]; 69: } 70: 71: return offset; 72: } 73: 74: unsigned int getLinearBlockId() { 75: return get_group_id(2) \* get_num_groups(1) \* get_num_groups(0) + 76: get_group_id(1) \* get_num_groups(0) + 77: get_group_id(0); 78: } 79: 80: // Block-wide reduction in shared memory helper; only /_threadIdx.x_/ get_local_id(0) == 0 will 81: // return the reduced value 82: float reduceBlock( local float\* smem, 83: int numVals, 84: float threadVal, 85: float init) { 86: if (numVals == 0) { 87: return init; 88: } 89: 90: if (get_local_id(0) < numVals) { 91: smem[ get_local_id(0)] = threadVal; 92: } 93: 94: // First warp will perform reductions across warps 95: barrier(CLK_LOCAL_MEM_FENCE); 96: if ((get_local_id(0) / 32) == 0) { 97: float r = get_local_id(0) < numVals ? smem[get_local_id(0)] : init; 98: 99: for (int i = 32 + get_local_id(0); i < numVals; i += 32) { 100: r = reduceOp(r, smem[i]); 101: } 102: 103: smem[get_local_id(0)] = r; 104: } 105: 106: // First thread will perform reductions across the block 107: barrier(CLK_LOCAL_MEM_FENCE); 108: 109: float r = init; 110: if (get_local_id(0) == 0) { 111: r = smem[0]; 112: 113: int numLanesParticipating = min(numVals, 32); 114: 115: if (numLanesParticipating == 32) { 116: // Unroll for 32 == 32 and numVals >= 32 117: // #pragma unroll 118: // unrolling by hand, so compiler-independent 119: 120: r = reduceOp(r, smem[1]); 121: 122: r = reduceOp(r, smem[2]); 123: 124: r = reduceOp(r, smem[3]); 125: 126: r = reduceOp(r, smem[4]); 127: 128: r = reduceOp(r, smem[5]); 129: 130: r = reduceOp(r, smem[6]); 131: 132: r = reduceOp(r, smem[7]); 133: 134: r = reduceOp(r, smem[8]); 135: 136: r = reduceOp(r, smem[9]); 137: 138: r = reduceOp(r, smem[10]); 139: 140: r = reduceOp(r, smem[11]); 141: 142: r = reduceOp(r, smem[12]); 143: 144: r = reduceOp(r, smem[13]); 145: 146: r = reduceOp(r, smem[14]); 147: 148: r = reduceOp(r, smem[15]); 149: 150: r = reduceOp(r, smem[16]); 151: 152: r = reduceOp(r, smem[17]); 153: 154: r = reduceOp(r, smem[18]); 155: 156: r = reduceOp(r, smem[19]); 157: 158: r = reduceOp(r, smem[20]); 159: 160: r = reduceOp(r, smem[21]); 161: 162: r = reduceOp(r, smem[22]); 163: 164: r = reduceOp(r, smem[23]); 165: 166: r = reduceOp(r, smem[24]); 167: 168: r = reduceOp(r, smem[25]); 169: 170: r = reduceOp(r, smem[26]); 171: 172: r = reduceOp(r, smem[27]); 173: 174: r = reduceOp(r, smem[28]); 175: 176: r = reduceOp(r, smem[29]); 177: 178: r = reduceOp(r, smem[30]); 179: 180: r = reduceOp(r, smem[31]); 181: 182: } else { 183: for (int i = 1; i < numLanesParticipating; ++i) { 184: r = reduceOp(r, smem[i]); 185: } 186: } 187: } 188: 189: return r; 190: } 191: 192: 193: 194: 195: 196: void op( global float _out 197: 198: 199: , float val1 200: 201: ) { 202: *out = val1; 203: } 204: 205: kernel void 206: THClTensor_pointwiseApplyD( 207: 208: global TensorInfoCl *info_1, 209: global float_data_1, 210: 211: 212: float val1, 213: 214: int totalElements) { 215: for (int linearIndex = get_global_id(0); 216: linearIndex < totalElements; 217: linearIndex += get_global_size(0)) { 218: 219: // Convert `linearIndex` into an offset of `a` 220: const int offset1 = 221: IndexToOffset_998_get(linearIndex, info_1[0]); 222: 223: 224: op( 225: 226: 227: &(data_1[offset1]) 228: 229: 230: , val1 231: 232: ); 233: } 234: } 235: 236: Invalid work group size, code -54 /Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:56: kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional `reduceDim` indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float\* offset; 52: // specialized on `Dims` to reduce nvcc compilation time 53: 54: 55: unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { 60: unsigned int offs stack traceback: [C]: in function 'fill' ./util/model_utils.lua:56: in function 'flatten' ./util/model_utils.lua:103: in function 'combine_all_parameters' train.lua:160: in main chunk [C]: in function 'dofile' ...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk [C]: at 0x010195b320
hughperkins commented 9 years ago

(though admittedly it should produce a less vicious error message than that...)

hughperkins commented 9 years ago

My apologies, I forgot to re-run the genreator step, so the cl code you received was unchanged. can you go into the directory from the earlier clone, and do git pull and rerun luarocks make pelase? :

cd cltorch-warnings
git pull
luarocks make rocks/cltorch-scm-1.rockspec

(edit: and then rerun the th train.lua step, as earlier)

Ambext commented 9 years ago

no worries at all Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0 registering spatialconvolutionmm using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
createForIndexedgpu gpu=0 createForindexedgpu gpu=0 verbose=1 gpu=0 currentGpuIndex=0 num_devices=1 Using Apple platform: Apple Using device: AMD Radeon R9 M295X Compute Engine statefultimer v0.6 THClApply.cl build log:

:49:6: warning: no previous prototype for function 'TensorInfo_isContiguous' bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { ^ :57:14: warning: no previous prototype for function 'IndexToOffset_998_get' unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { ^ :61:14: warning: no previous prototype for function 'IndexToOffset_999_get' unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { ^ :76:14: warning: no previous prototype for function 'getLinearBlockId' unsigned int getLinearBlockId() { ^ :84:7: warning: no previous prototype for function 'reduceBlock' float reduceBlock( local float\* smem, ^ :198:6: warning: no previous prototype for function 'op' void op( global float *out ^ kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: static float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional `reduceDim` indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float\* offset; 52: // specialized on `Dims` to reduce nvcc compilation time 53: 54: 55: unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { 60: unsigned int offset = info.offset; 61: 62: // Use dynamic dims 63: for (int i = info.dims - 1; i >= 0; --i) { 64: unsigned int curDimIndex = linearId % info.sizes[i]; 65: unsigned int curDimOffset = curDimIndex \* info.strides[i]; 66: offset += curDimOffset; 67: 68: linearId /= info.sizes[i]; 69: } 70: 71: return offset; 72: } 73: 74: unsigned int getLinearBlockId() { 75: return get_group_id(2) \* get_num_groups(1) \* get_num_groups(0) + 76: get_group_id(1) \* get_num_groups(0) + 77: get_group_id(0); 78: } 79: 80: // Block-wide reduction in shared memory helper; only /_threadIdx.x_/ get_local_id(0) == 0 will 81: // return the reduced value 82: float reduceBlock( local float\* smem, 83: int numVals, 84: float threadVal, 85: float init) { 86: if (numVals == 0) { 87: return init; 88: } 89: 90: if ((int)get_local_id(0) < numVals) { 91: smem[ get_local_id(0)] = threadVal; 92: } 93: 94: // First warp will perform reductions across warps 95: barrier(CLK_LOCAL_MEM_FENCE); 96: if ((get_local_id(0) / 32) == 0) { 97: float r = (int)get_local_id(0) < numVals ? smem[get_local_id(0)] : init; 98: 99: for (int i = 32 + get_local_id(0); i < numVals; i += 32) { 100: r = reduceOp(r, smem[i]); 101: } 102: 103: smem[get_local_id(0)] = r; 104: } 105: 106: // First thread will perform reductions across the block 107: barrier(CLK_LOCAL_MEM_FENCE); 108: 109: float r = init; 110: if (get_local_id(0) == 0) { 111: r = smem[0]; 112: 113: int numLanesParticipating = min(numVals, 32); 114: 115: if (numLanesParticipating == 32) { 116: // Unroll for 32 == 32 and numVals >= 32 117: // #pragma unroll 118: // unrolling by hand, so compiler-independent 119: 120: r = reduceOp(r, smem[1]); 121: 122: r = reduceOp(r, smem[2]); 123: 124: r = reduceOp(r, smem[3]); 125: 126: r = reduceOp(r, smem[4]); 127: 128: r = reduceOp(r, smem[5]); 129: 130: r = reduceOp(r, smem[6]); 131: 132: r = reduceOp(r, smem[7]); 133: 134: r = reduceOp(r, smem[8]); 135: 136: r = reduceOp(r, smem[9]); 137: 138: r = reduceOp(r, smem[10]); 139: 140: r = reduceOp(r, smem[11]); 141: 142: r = reduceOp(r, smem[12]); 143: 144: r = reduceOp(r, smem[13]); 145: 146: r = reduceOp(r, smem[14]); 147: 148: r = reduceOp(r, smem[15]); 149: 150: r = reduceOp(r, smem[16]); 151: 152: r = reduceOp(r, smem[17]); 153: 154: r = reduceOp(r, smem[18]); 155: 156: r = reduceOp(r, smem[19]); 157: 158: r = reduceOp(r, smem[20]); 159: 160: r = reduceOp(r, smem[21]); 161: 162: r = reduceOp(r, smem[22]); 163: 164: r = reduceOp(r, smem[23]); 165: 166: r = reduceOp(r, smem[24]); 167: 168: r = reduceOp(r, smem[25]); 169: 170: r = reduceOp(r, smem[26]); 171: 172: r = reduceOp(r, smem[27]); 173: 174: r = reduceOp(r, smem[28]); 175: 176: r = reduceOp(r, smem[29]); 177: 178: r = reduceOp(r, smem[30]); 179: 180: r = reduceOp(r, smem[31]); 181: 182: } else { 183: for (int i = 1; i < numLanesParticipating; ++i) { 184: r = reduceOp(r, smem[i]); 185: } 186: } 187: } 188: 189: return r; 190: } 191: 192: 193: 194: 195: 196: void op( global float _out 197: 198: 199: , float val1 200: 201: ) { 202: *out = val1; 203: } 204: 205: kernel void 206: THClTensor_pointwiseApplyD( 207: 208: global TensorInfoCl *info_1, 209: global float_data_1, 210: 211: 212: float val1, 213: 214: int totalElements) { 215: for (int linearIndex = get_global_id(0); 216: linearIndex < totalElements; 217: linearIndex += get_global_size(0)) { 218: 219: // Convert `linearIndex` into an offset of `a` 220: const int offset1 = 221: IndexToOffset_998_get(linearIndex, info_1[0]); 222: 223: 224: op( 225: 226: 227: &(data_1[offset1]) 228: 229: 230: , val1 231: 232: ); 233: } 234: } 235: 236: Invalid work group size, code -54 /Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:56: kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: static float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional `reduceDim` indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float\* offset; 52: // specialized on `Dims` to reduce nvcc compilation time 53: 54: 55: unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { 60: unsigned i stack traceback: [C]: in function 'fill' ./util/model_utils.lua:56: in function 'flatten' ./util/model_utils.lua:103: in function 'combine_all_parameters' train.lua:160: in main chunk [C]: in function 'dofile' ...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk [C]: at 0x0107499320 Exowide:char-rnn mnemonis$
hughperkins commented 9 years ago

Hmmm, that invalid workgroup size error is pretty intereseting. Thats the problem that is causing the crash. Do you mind doign the following please?

th
require 'cltorch'
cltorch.getDeviceProperties(1)
hughperkins commented 9 years ago

(Ok (just got your update), ok, seems we got rid of the reduceOp warning about no prototype. I'll make that change on the other methods. )

Ambext commented 9 years ago

th> require 'cltorch' { finish : function: 0x05ba70b8 getState : function: 0x05ba75d0 getDeviceCount : function: 0x05ba7108 setTrace : function: 0x05ba7620 setAddFinish : function: 0x05ba7670 setDevice : function: 0x05ba7000 synchronize : function: 0x05ba7028 _state : userdata: 0x7fac78600160 dumpTimings : function: 0x05ba76c0 getDevice : function: 0x05bb80b8 getDeviceProperties : function: 0x05ba7580 } [0.0093s] th> cltorch.getDeviceProperties(1) { deviceType : "CPU" localMemSizeKB : 32 globalMemSizeMB : 16384 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz" maxComputeUnits : 8 globalMemCachelineSizeKB : 8192 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 4000 maxMemAllocSizeMB : 4096 maxWorkGroupSize : 1024 } [0.0001s] th>

hughperkins commented 9 years ago

Hmmm. I think you have an earlier version, with a bug on getDeviceProeprties. Do you mind also doing cltorch.getDeviceProperties(2) please?

hughperkins commented 9 years ago

Oh, theres a bug, soyou cant. let me fix that :-P

Ambext commented 9 years ago

I can it yields

th> cltorch.getDeviceProperties(2) { deviceType : "GPU" localMemSizeKB : 32 globalMemSizeMB : 4096 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "AMD Radeon R9 M295X Compute Engine" maxComputeUnits : 32 globalMemCachelineSizeKB : 0 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 850 maxMemAllocSizeMB : 1024 maxWorkGroupSize : 256 }

Ambext commented 9 years ago

(and I should have figured out by myself that you wanted the GPU properties...)

hughperkins commented 9 years ago

Thanks! Hmmm, workgroupsize 256. that should be ok. hmmm... pondering...

hughperkins commented 9 years ago

Oh, but I bet it's linked with the bug in getDeviceProperties

Looking into this.

hughperkins commented 9 years ago

Ok, seems it's not easy for me to test the getpreoprties etc, because all my machines have either only opencl gpus, or only opencl cpus, but not both :-P Do you mind doing a git pull, and trying again?

cd cltorch-warnings
git pull
luarocks make rocks/cltorch-scm-1.rockspec

When you do th train.lua, as far as I know, I think it needs currnetly -gpuid 0.

hughperkins commented 9 years ago

You know what, I do have a machien I can check the deviceproperties bit on. Let me test that first...

Ambext commented 9 years ago

ok. FYI I will be afk for ~ 3 hours starting in 5 minutes

hughperkins commented 9 years ago

Hmmm, ok. Well, the deviceproperties thing is fixed now. And I fixed the earlier warnings. Do you want to run now, and paste whatever comes out? (I think there are a ton of new warnings :-/ )

Ambext commented 9 years ago

Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0 registering spatialconvolutionmm using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Apple platform: Apple Using device: AMD Radeon R9 M295X Compute Engine statefultimer v0.6 THClApply.cl build log:

:84:7: warning: no previous prototype for function 'reduceBlock' float reduceBlock( local float\* smem, ^ kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: static float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional `reduceDim` indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: static bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float\* offset; 52: // specialized on `Dims` to reduce nvcc compilation time 53: 54: 55: static unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: static unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { 60: unsigned int offset = info.offset; 61: 62: // Use dynamic dims 63: for (int i = info.dims - 1; i >= 0; --i) { 64: unsigned int curDimIndex = linearId % info.sizes[i]; 65: unsigned int curDimOffset = curDimIndex \* info.strides[i]; 66: offset += curDimOffset; 67: 68: linearId /= info.sizes[i]; 69: } 70: 71: return offset; 72: } 73: 74: static unsigned int getLinearBlockId() { 75: return get_group_id(2) \* get_num_groups(1) \* get_num_groups(0) + 76: get_group_id(1) \* get_num_groups(0) + 77: get_group_id(0); 78: } 79: 80: // Block-wide reduction in shared memory helper; only /_threadIdx.x_/ get_local_id(0) == 0 will 81: // return the reduced value 82: float reduceBlock( local float\* smem, 83: int numVals, 84: float threadVal, 85: float init) { 86: if (numVals == 0) { 87: return init; 88: } 89: 90: if ((int)get_local_id(0) < numVals) { 91: smem[ get_local_id(0)] = threadVal; 92: } 93: 94: // First warp will perform reductions across warps 95: barrier(CLK_LOCAL_MEM_FENCE); 96: if ((get_local_id(0) / 32) == 0) { 97: float r = (int)get_local_id(0) < numVals ? smem[get_local_id(0)] : init; 98: 99: for (int i = 32 + get_local_id(0); i < numVals; i += 32) { 100: r = reduceOp(r, smem[i]); 101: } 102: 103: smem[get_local_id(0)] = r; 104: } 105: 106: // First thread will perform reductions across the block 107: barrier(CLK_LOCAL_MEM_FENCE); 108: 109: float r = init; 110: if (get_local_id(0) == 0) { 111: r = smem[0]; 112: 113: int numLanesParticipating = min(numVals, 32); 114: 115: if (numLanesParticipating == 32) { 116: // Unroll for 32 == 32 and numVals >= 32 117: // #pragma unroll 118: // unrolling by hand, so compiler-independent 119: 120: r = reduceOp(r, smem[1]); 121: 122: r = reduceOp(r, smem[2]); 123: 124: r = reduceOp(r, smem[3]); 125: 126: r = reduceOp(r, smem[4]); 127: 128: r = reduceOp(r, smem[5]); 129: 130: r = reduceOp(r, smem[6]); 131: 132: r = reduceOp(r, smem[7]); 133: 134: r = reduceOp(r, smem[8]); 135: 136: r = reduceOp(r, smem[9]); 137: 138: r = reduceOp(r, smem[10]); 139: 140: r = reduceOp(r, smem[11]); 141: 142: r = reduceOp(r, smem[12]); 143: 144: r = reduceOp(r, smem[13]); 145: 146: r = reduceOp(r, smem[14]); 147: 148: r = reduceOp(r, smem[15]); 149: 150: r = reduceOp(r, smem[16]); 151: 152: r = reduceOp(r, smem[17]); 153: 154: r = reduceOp(r, smem[18]); 155: 156: r = reduceOp(r, smem[19]); 157: 158: r = reduceOp(r, smem[20]); 159: 160: r = reduceOp(r, smem[21]); 161: 162: r = reduceOp(r, smem[22]); 163: 164: r = reduceOp(r, smem[23]); 165: 166: r = reduceOp(r, smem[24]); 167: 168: r = reduceOp(r, smem[25]); 169: 170: r = reduceOp(r, smem[26]); 171: 172: r = reduceOp(r, smem[27]); 173: 174: r = reduceOp(r, smem[28]); 175: 176: r = reduceOp(r, smem[29]); 177: 178: r = reduceOp(r, smem[30]); 179: 180: r = reduceOp(r, smem[31]); 181: 182: } else { 183: for (int i = 1; i < numLanesParticipating; ++i) { 184: r = reduceOp(r, smem[i]); 185: } 186: } 187: } 188: 189: return r; 190: } 191: 192: 193: 194: 195: 196: static void op( global float _out 197: 198: 199: , float val1 200: 201: ) { 202: *out = val1; 203: } 204: 205: kernel void 206: THClTensor_pointwiseApplyD( 207: 208: global TensorInfoCl *info_1, 209: global float_data_1, 210: 211: 212: float val1, 213: 214: int totalElements) { 215: for (int linearIndex = get_global_id(0); 216: linearIndex < totalElements; 217: linearIndex += get_global_size(0)) { 218: 219: // Convert `linearIndex` into an offset of `a` 220: const int offset1 = 221: IndexToOffset_998_get(linearIndex, info_1[0]); 222: 223: 224: op( 225: 226: 227: &(data_1[offset1]) 228: 229: 230: , val1 231: 232: ); 233: } 234: } 235: 236: Invalid work group size, code -54 /Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:56: kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: static float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional `reduceDim` indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: static bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float\* offset; 52: // specialized on `Dims` to reduce nvcc compilation time 53: 54: 55: static unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: static unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl inf stack traceback: [C]: in function 'fill' ./util/model_utils.lua:56: in function 'flatten' ./util/model_utils.lua:103: in function 'combine_all_parameters' train.lua:160: in main chunk [C]: in function 'dofile' ...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk [C]: at 0x01018f5320
hughperkins commented 9 years ago

Hmm, ok :-P And can you provide the output of cltorch.getDeviceProperties(1) please?

Ambext commented 9 years ago

th> cltorch.getDeviceProperties(1) { deviceType : "CPU" localMemSizeKB : 32 globalMemSizeMB : 16384 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz" maxComputeUnits : 8 globalMemCachelineSizeKB : 8192 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 4000 maxMemAllocSizeMB : 4096 maxWorkGroupSize : 1024 } [0.0001s] th> cltorch.getDeviceProperties(2) { deviceType : "GPU" localMemSizeKB : 32 globalMemSizeMB : 4096 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "AMD Radeon R9 M295X Compute Engine" maxComputeUnits : 32 globalMemCachelineSizeKB : 0 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 850 maxMemAllocSizeMB : 1024 maxWorkGroupSize : 256 }

hughperkins commented 9 years ago

Ummm... thats odd. you shouldnt see the cpu device.

hughperkins commented 9 years ago

It seems like you have an old version of easycl perhaps. When you get back, can you do:

cd cltorch-warnings
rm -Rf build
git pull
git log -n 3 --oneline
luarocks make rocks/cltorch-scm-1.rockspec
(cd EasyCL; git log -n 3 --oneline)

... and then retry the cltorch.getDeviceProperties(1), and cltorch.getDeviceCount() please?

hughperkins commented 9 years ago

Note: since 'static' gives a bunch of warnings on Hawaii, I've changed to inline. Not sure how well that works on the 295x? Thank-you for all your help in fixing this problem by the way :-)

Ambext commented 9 years ago

you are welcome - thanks for doing this

th> cltorch.getDeviceProperties(1) { deviceType : "GPU" localMemSizeKB : 32 globalMemSizeMB : 4096 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "AMD Radeon R9 M295X Compute Engine" maxComputeUnits : 32 globalMemCachelineSizeKB : 0 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 850 maxMemAllocSizeMB : 1024 maxWorkGroupSize : 256 } [0.0001s] th> cltorch.getDeviceCount() 1

hughperkins commented 9 years ago

Ah, that output is what I would expect to see. Can you try running char-rnn again please?

Ambext commented 9 years ago

Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0 registering spatialconvolutionmm using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Apple platform: Apple Using device: AMD Radeon R9 M295X Compute Engine statefultimer v0.6

kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: inline float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional reduceDim indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: inline bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float* offset; 52: // specialized on Dims to reduce nvcc compilation time 53: 54: 55: inline unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: inline unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { 60: unsigned int offset = info.offset; 61: 62: // Use dynamic dims 63: for (int i = info.dims - 1; i >= 0; --i) { 64: unsigned int curDimIndex = linearId % info.sizes[i]; 65: unsigned int curDimOffset = curDimIndex * info.strides[i]; 66: offset += curDimOffset; 67: 68: linearId /= info.sizes[i]; 69: } 70: 71: return offset; 72: } 73: 74: inline unsigned int getLinearBlockId() { 75: return get_group_id(2) * get_num_groups(1) * get_num_groups(0) + 76: get_group_id(1) * get_num_groups(0) + 77: get_group_id(0); 78: } 79: 80: // Block-wide reduction in shared memory helper; only /threadIdx.x/ get_local_id(0) == 0 will 81: // return the reduced value 82: inline float reduceBlock( local float* smem, 83: int numVals, 84: float threadVal, 85: float init) { 86: if (numVals == 0) { 87: return init; 88: } 89: 90: if ((int)get_local_id(0) < numVals) { 91: smem[ get_local_id(0)] = threadVal; 92: } 93: 94: // First warp will perform reductions across warps 95: barrier(CLK_LOCAL_MEM_FENCE); 96: if ((get_local_id(0) / 32) == 0) { 97: float r = (int)get_local_id(0) < numVals ? smem[get_local_id(0)] : init; 98: 99: for (int i = 32 + get_local_id(0); i < numVals; i += 32) { 100: r = reduceOp(r, smem[i]); 101: } 102: 103: smem[get_local_id(0)] = r; 104: } 105: 106: // First thread will perform reductions across the block 107: barrier(CLK_LOCAL_MEM_FENCE); 108: 109: float r = init; 110: if (get_local_id(0) == 0) { 111: r = smem[0]; 112: 113: int numLanesParticipating = min(numVals, 32); 114: 115: if (numLanesParticipating == 32) { 116: // Unroll for 32 == 32 and numVals >= 32 117: // #pragma unroll 118: // unrolling by hand, so compiler-independent 119:
120: r = reduceOp(r, smem[1]); 121:
122: r = reduceOp(r, smem[2]); 123:
124: r = reduceOp(r, smem[3]); 125:
126: r = reduceOp(r, smem[4]); 127:
128: r = reduceOp(r, smem[5]); 129:
130: r = reduceOp(r, smem[6]); 131:
132: r = reduceOp(r, smem[7]); 133:
134: r = reduceOp(r, smem[8]); 135:
136: r = reduceOp(r, smem[9]); 137:
138: r = reduceOp(r, smem[10]); 139:
140: r = reduceOp(r, smem[11]); 141:
142: r = reduceOp(r, smem[12]); 143:
144: r = reduceOp(r, smem[13]); 145:
146: r = reduceOp(r, smem[14]); 147:
148: r = reduceOp(r, smem[15]); 149:
150: r = reduceOp(r, smem[16]); 151:
152: r = reduceOp(r, smem[17]); 153:
154: r = reduceOp(r, smem[18]); 155:
156: r = reduceOp(r, smem[19]); 157:
158: r = reduceOp(r, smem[20]); 159:
160: r = reduceOp(r, smem[21]); 161:
162: r = reduceOp(r, smem[22]); 163:
164: r = reduceOp(r, smem[23]); 165:
166: r = reduceOp(r, smem[24]); 167:
168: r = reduceOp(r, smem[25]); 169:
170: r = reduceOp(r, smem[26]); 171:
172: r = reduceOp(r, smem[27]); 173:
174: r = reduceOp(r, smem[28]); 175:
176: r = reduceOp(r, smem[29]); 177:
178: r = reduceOp(r, smem[30]); 179:
180: r = reduceOp(r, smem[31]); 181:
182: } else { 183: for (int i = 1; i < numLanesParticipating; ++i) { 184: r = reduceOp(r, smem[i]); 185: } 186: } 187: } 188: 189: return r; 190: } 191: 192: 193: 194: 195: 196: inline void op( global float _out 197:
198:
199: , float val1 200:
201: ) { 202: out = val1; 203: } 204: 205: kernel void 206: THClTensor_pointwiseApplyD( 207:
208: global TensorInfoCl
info_1, 209: global float_data_1, 210:
211:
212: float val1, 213:
214: int totalElements) { 215: for (int linearIndex = get_global_id(0); 216: linearIndex < totalElements; 217: linearIndex += get_global_size(0)) { 218:
219: // Convert linearIndex into an offset of a 220: const int offset1 = 221: IndexToOffset_998_get(linearIndex, info_1[0]); 222:
223: 224: op( 225:
226:
227: &(data_1[offset1]) 228:
229:
230: , val1 231:
232: ); 233: } 234: } 235: 236:

Invalid work group size, code -54 /Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:56: kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: inline float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional reduceDim indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: inline bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float* offset; 52: // specialized on Dims to reduce nvcc compilation time 53: 54: 55: inline unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: inline unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl inf stack traceback: [C]: in function 'fill' ./util/model_utils.lua:56: in function 'flatten' ./util/model_utils.lua:103: in function 'combine_all_parameters' train.lua:160: in main chunk [C]: in function 'dofile' ...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk [C]: at 0x0101933320

hughperkins commented 9 years ago

Ok. So, no more compile warnings, thats good :-) The message about invalid workgroup size is odd...

szagoruyko commented 9 years ago

on amd max block dimension is 512 and not 1024, I guess that's the problem

hughperkins commented 9 years ago

It's failing on ':fill()', which should just call an OpenCL kernel called apply, which runs on other AMD cards which have a workgroupsize of 256.

hughperkins commented 9 years ago

@szagoruyko Yes, but it's not my first attempt on AMD, and I fixed these issues, in theory, so that it reads maxworkgorupsize from deviceinfo.

hughperkins commented 9 years ago

Oh, I've found the problem I think this line should use getGpuInfo, not getDeviceInfo. And finally we wll arrive at Ambext's original problem, before I hacked around with changing from showing all devices, to showing only GPUs and APUs :-P

hughperkins commented 9 years ago

Hi @Ambext, can you do git pull on cltorch, and try again please?

cd cltorch-warnings
git pull
luarocks make rocks/cltorch-scm-1.rockspec

(and then run the char-rnn train.lua script again, as before)

Ambext commented 9 years ago

Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0 registering spatialconvolutionmm using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Apple platform: Apple Using device: AMD Radeon R9 M295X Compute Engine statefultimer v0.6 /Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:76: Tensor: invalid storage offset at /Users/mnemonis/Documents/Code_Ressources/cltorch-warnings/lib/THCl/THClTensor.cpp:645 stack traceback: [C]: in function 'set' ./util/model_utils.lua:76: in function 'flatten' ./util/model_utils.lua:103: in function 'combine_all_parameters' train.lua:160: in main chunk [C]: in function 'dofile' ...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk [C]: at 0x010061d320 Exowide:char-rnn mnemonis$

hughperkins commented 9 years ago

Ah-hah, now we arrive at the original error :-)

hughperkins commented 9 years ago

Hmmm, thats a pretty bizarre error...

  /* storageOffset */
  if(storageOffset < 0)
    THError("Tensor: invalid storage offset");
  self->storageOffset = storageOffset;

There's no logical reason why storageOffset would be negative, under normal circumstances. Must be a bug somewhere else in the program :-(

Ambext commented 9 years ago

do you still want me to run the th -l cltorch -e "cltorch.test()" th -l cltorch -e "a = torch.ClTensor(50,100):fill(0.2); cltorch.dumpTimings()" ? the first line yields a massive amount of info. Pasting it here would be non nonsensical.

hughperkins commented 9 years ago

Hmmm, yeah, lets get the output from that, though I think it all passes ok right?

Ambext commented 9 years ago

there are some warnings. I will post a cubby link soon.

hughperkins commented 9 years ago

ok.

Ambext commented 9 years ago

woah. The cltorch.test output txt file is 2.5 Mb!

Ambext commented 9 years ago

Exowide:~ mnemonis$ th -l cltorch -e "a = torch.ClTensor(50,100):fill(0.2); cltorch.dumpTimings()" Using Apple platform: Apple Using device: AMD Radeon R9 M295X Compute Engine statefultimer v0.6 StatefulTimer readings: Apply1 2: 0.163818ms Apply1 3: 0.0161133ms Apply1 4: 0.000976562ms Apply1 5: 2.05005ms Apply1 6: 0.00195312ms Apply1 6a: 0.0358887ms Apply1 7: 0.0251465ms Apply1 8: 0.0100098ms Apply1 start: 0.00512695ms before dump: 0.0078125ms

Ambext commented 9 years ago

link for the cltorch.test output https://www.cubbyusercontent.com/pli/Cltorch.testt+()+Terminal+Saved+Output.txt/_6cada406541d4eb7a4b9aa3c5308ab0a

hughperkins commented 9 years ago

Thanks.

hughperkins commented 9 years ago

Hmmm, its that long, because loads of the tests are failing, and so its printing the result matrices.

hughperkins commented 9 years ago

loads of 'nan' stuff. we should probably fix that before we look at char-rnn. I suspect that once we've fixed the nan stuff, nad the tests in general, char-rnn will work a bit better :-)

hughperkins commented 9 years ago

When you get a moment, can you try running the following please?

th -l cltorch -e "require 'cltorch.unit_tensor'; tester = torch.Tester(); tester.countasserts = 0; cltorch.tests.tensor.test_fills()"
th -l cltorch -e "require 'cltorch.unit_tensor'; tester = torch.Tester(); tester.countasserts = 0; cltorch.tests.tensor.test_apply()"
Ambext commented 9 years ago

Exowide:~ mnemonis$ th -l cltorch -e "require 'cltorch.unit_tensor'; tester = torch.Tester(); tester.countasserts = 0; cltorch.tests.tensor.test_fills()" Using Apple platform: Apple Using device: AMD Radeon R9 M295X Compute Engine statefultimer v0.6 left 1.3450 1.3450 1.3450 1.3450 1.3450 1.3450 [torch.FloatTensor of size 3x2]

right -2.0359e+38 -2.0359e+38 -2.0359e+38 -2.0359e+38 -2.0359e+38 -2.0359e+38 [torch.FloatTensor of size 3x2]

diff 2.0359e+38 2.0359e+38 2.0359e+38 2.0359e+38 2.0359e+38 2.0359e+38 [torch.FloatTensor of size 3x2]

left 0 0 0 0 0 0 [torch.FloatTensor of size 3x2]

right -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 [torch.FloatTensor of size 3x2]

diff 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 [torch.FloatTensor of size 3x2]

left 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size 3x5]

right -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 [torch.FloatTensor of size 3x5]

diff 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 [torch.FloatTensor of size 3x5]