hughperkins / clnn

OpenCL backend for Torch nn neural networks library
BSD 2-Clause "Simplified" License
125 stars 16 forks source link

clnn for neuralconvo support. LogSoftMax.lua:23 #38

Closed tigerneil closed 8 years ago

tigerneil commented 8 years ago

Hey I am trying to add cltorch for neuralconvo . and my code is here cltorch support neural conversation model.

bash-3.2$ th train.lua --dataset 1000 --hiddenSize 100
libthclnn_searchpath    /Users/zhuxiaohu/torch/install/lib/lua/5.1/libTHCLNN.so
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
 [=============================================================== 387810/387810 =======>] Tot: 2s514ms | Step: 0ms
-- Pre-processing data
 [============================================================= 1000/1000 =============>] Tot: 242ms | Step: 0ms
-- Removing low frequency words
 [============================================================ 1569/1569 ==============>] Tot: 99ms | Step: 0ms
Writing data/examples.t7 ...
 [============================================================= 1569/1569 =============>] Tot: 323ms | Step: 0ms
Writing data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 2536
         Examples: 1569
Using Apple , OpenCL platform: Apple
Using OpenCL device: HD Graphics 4000

It seems all right until the following error occurs:

kernel source:
1: // Threads per thread block
2: #define THCL_NONCONTIG_REDUCE_BLOCK_SIZE 32 * 16
3:
4: inline float modifyOp(float _in1) {
5:   float _out;
6:   float *in1 = &_in1;
7:   float *out = &_out;
8:   *out = *in1;
9:   return _out;
10: }
11:
12: inline float reduceOp(float _in1, float _in2) {
13:   // I guess the compiler can sort this stuff out :-P
14:   float _out;
15:   float *in1 = &_in1;
16:   float *in2 = &_in2;
17:   float *out = &_out;
18:   *out = *in1 + *in2;
19:   return _out;
20: }
21:
22: // kernel argument that defines tensor layout
23: typedef struct TensorInfoCl {
24:   // Extracts size/stride information for the kernel.
25:   // Successive dimensions can be collapsed if the size/strides match
26:   // up and thus there are no holes between the dimensions. This is used
27:   // to reduce the complexity of the problem.
28:   // The optional `reduceDim` indicates a reduction dimension for the
29:   // given tensor, so that the output size for this dimension will be 1.
30:
31:   int sizes[25];
32:   int strides[25];
33:   int offset;
34:   int dims;
35: } TensorInfoCl;
36: // Contiguous tensors of more than one dimension are collapsed down
37: // to one tensor
38:
39:
40: // Translate a linear index for the apply to a float* offset;
41: // specialized on `Dims` to reduce nvcc compilation time
42:
43:
44: inline int IndexToOffset_998_get(int linearId, global const TensorInfoCl *info) {
45:     return linearId + info->offset;
46: }
47:
48: inline int IndexToOffset_999_get(int linearId, global const TensorInfoCl *info) {
49:   int offset = info->offset;
50:
51:   // Use dynamic dims
52:   for (int i = info->dims - 1; i >= 0; --i) {
53:     int curDimIndex = linearId % info->sizes[i];
54:     int curDimOffset = curDimIndex * info->strides[i];
55:     offset += curDimOffset;
56:
57:     linearId /= info->sizes[i];
58:   }
59:
60:   return offset;
61: }
62:
63: inline int getLinearBlockId() {
64:   return get_group_id(2) * get_num_groups(1) * get_num_groups(0) +
65:     get_group_id(1) * get_num_groups(0) +
66:     get_group_id(0);
67: }
68:
69: // Block-wide reduction in shared memory helper; only /*threadIdx.x*/ get_local_id(0) == 0 will
70: // return the reduced value
71:
72: inline float reduceBlock( local float* smem,
73:                    int numVals,
74:                    float threadVal,
75:                    float init) {
76:   if (numVals == 0) {
77:     return init;
78:   }
79:
80:   if ((int)get_local_id(0) < numVals) {
81:     smem[ get_local_id(0)] = threadVal;
82:   }
83:
84:   // First warp will perform reductions across warps
85:   barrier(CLK_LOCAL_MEM_FENCE);
86:   if ((get_local_id(0) / 32) == 0) {
87:     float r = (int)get_local_id(0) < numVals ? smem[get_local_id(0)] : init;
88:
89:     for (int i = 32 + get_local_id(0); i < numVals; i += 32) {
90:       r = reduceOp(r, smem[i]);
91:     }
92:
93:     smem[get_local_id(0)] = r;
94:   }
95:
96:   // First thread will perform reductions across the block
97:   barrier(CLK_LOCAL_MEM_FENCE);
98:
99:   float r = init;
100:   if (get_local_id(0) == 0) {
101:     r = smem[0];
102:
103:     int numLanesParticipating = min(numVals, 32);
104:
105:     if (numLanesParticipating == 32) {
106:       // Unroll for 32 == 32 and numVals >= 32
107:       // #pragma unroll
108:       // unrolling by hand, so compiler-independent
109:
110:         r = reduceOp(r, smem[1]);
111:
112:         r = reduceOp(r, smem[2]);
113:
114:         r = reduceOp(r, smem[3]);
115:
116:         r = reduceOp(r, smem[4]);
117:
118:         r = reduceOp(r, smem[5]);
119:
120:         r = reduceOp(r, smem[6]);
121:
122:         r = reduceOp(r, smem[7]);
123:
124:         r = reduceOp(r, smem[8]);
125:
126:         r = reduceOp(r, smem[9]);
127:
128:         r = reduceOp(r, smem[10]);
129:
130:         r = reduceOp(r, smem[11]);
131:
132:         r = reduceOp(r, smem[12]);
133:
134:         r = reduceOp(r, smem[13]);
135:
136:         r = reduceOp(r, smem[14]);
137:
138:         r = reduceOp(r, smem[15]);
139:
140:         r = reduceOp(r, smem[16]);
141:
142:         r = reduceOp(r, smem[17]);
143:
144:         r = reduceOp(r, smem[18]);
145:
146:         r = reduceOp(r, smem[19]);
147:
148:         r = reduceOp(r, smem[20]);
149:
150:         r = reduceOp(r, smem[21]);
151:
152:         r = reduceOp(r, smem[22]);
153:
154:         r = reduceOp(r, smem[23]);
155:
156:         r = reduceOp(r, smem[24]);
157:
158:         r = reduceOp(r, smem[25]);
159:
160:         r = reduceOp(r, smem[26]);
161:
162:         r = reduceOp(r, smem[27]);
163:
164:         r = reduceOp(r, smem[28]);
165:
166:         r = reduceOp(r, smem[29]);
167:
168:         r = reduceOp(r, smem[30]);
169:
170:         r = reduceOp(r, smem[31]);
171:
172:     } else {
173:       for (int i = 1; i < numLanesParticipating; ++i) {
174:         r = reduceOp(r, smem[i]);
175:       }
176:     }
177:   }
178:
179:   return r;
180: }
181:
182:
183:
184:
185: inline int getReduceNoncontigDimSliceIndex() {
186:   // Each thread handles one slice
187:   return getLinearBlockId() * THCL_NONCONTIG_REDUCE_BLOCK_SIZE + /*threadIdx.x*/ get_local_id(0);
188: }
189:
190: // Kernel that handles an entire reduction of a slice of a tensor per each thread
191: kernel void
192: THClTensor_reduceNoncontigDim(global TensorInfoCl *out_info,
193:                               global float *out_data,
194:                               global TensorInfoCl *in_info,
195:                               global float *in_data,
196:                               int reductionStride,
197:                               int reductionSize,
198:                               int totalSlices,
199:                               float init) {
200:   const int sliceIndex = getReduceNoncontigDimSliceIndex();
201:
202:   if ((int)sliceIndex >= totalSlices) {
203:     return;
204:   }
205:
206:   // Each thread picks a point in `out` and `in` for which it is
207:   // producing the reduction
208:   const int outOffset =
209:     IndexToOffset_998_get(sliceIndex, &out_info[0]);
210:   const int inBaseOffset =
211:     IndexToOffset_998_get(sliceIndex, &in_info[0]);
212:
213:   // For each point in reductionSize, reduce into `r`
214:   int inOffset = inBaseOffset;
215:   float r = init;
216:
217:   for (int i = 0; (int)i < reductionSize; ++i) {
218:     r = reduceOp(r, modifyOp(in_data[inOffset]));
219:     inOffset += reductionStride;
220:   }
221:
222:   // Write out reduced value
223:   out_data[outOffset] = r;
224: }
225:
226: inline int getReduceContigDimSliceIndex() {
227:   // Each block handles one slice
228:   return getLinearBlockId();
229: }
230:
231: // Kernel that handles an entire reduction of a slice of a tensor per
232: // each block
233: kernel void
234: THClTensor_reduceContigDim(global TensorInfoCl *out_info,
235:                            global float *out_data,
236:                            global TensorInfoCl *in_info,
237:                            global float *in_data,
238:                            int reductionSize,
239:                            int totalSlices,
240:                            float init,
241:                            local float *smem) {
242:   const int sliceIndex = getReduceContigDimSliceIndex();
243:
244:   if ((int)sliceIndex >= totalSlices) {
245:     return;
246:   }
247:
248:   // Get the offset in `out` for the reduction
249:   const int outOffset =
250:     IndexToOffset_998_get(sliceIndex, &out_info[0]);
251:
252:   // Get the base offset in `in` for this block's reduction
253:   const int inBaseOffset =
254:     IndexToOffset_998_get(sliceIndex, &in_info[0]);
255:
256:   // Each thread in the block will reduce some subset of elements in
257:   // the slice. The elements are guaranteed contiguous starting at
258:   // `inBaseOffset`.
259:   float r = init;
260:   for (int i = /*threadIdx.x*/ get_local_id(0); (int)i < reductionSize; i += /*blockDim.x*/ get_local_size(0)) {
261:     r = reduceOp(r, modifyOp(in_data[inBaseOffset + i]));
262:   }
263:
264:   // Reduce within the block
265: //  extern __shared__ float smem[];
266:   r = reduceBlock(smem, /*blockDim.x*/ get_local_size(0), r, init);
267:
268:   if (/*threadIdx.x*/ get_local_id(0) == 0) {
269:     // Write out reduced value
270:     out_data[outOffset] = r;
271:   }
272: }
273:
274:

Invalid work group size, code -54
/Users/zhuxiaohu/torch/install/bin/luajit: ...huxiaohu/torch/install/share/lua/5.1/clnn/LogSoftMax.lua:23:
kernel source:
1: // Threads per thread block
2: #define THCL_NONCONTIG_REDUCE_BLOCK_SIZE 32 * 16
3:
4: inline float modifyOp(float _in1) {
5:   float _out;
6:   float *in1 = &_in1;
7:   float *out = &_out;
8:   *out = *in1;
9:   return _out;
10: }
11:
12: inline float reduceOp(float _in1, float _in2) {
13:   // I guess the compiler can sort this stuff out :-P
14:   float _out;
15:   float *in1 = &_in1;
16:   float *in2 = &_in2;
17:   float *out = &_out;
18:   *out = *in1 + *in2;
19:   return _out;
20: }
21:
22: // kernel argument that defines tensor layout
23: typedef struct TensorInfoCl {
24:   // Extracts size/stride information for the kernel.
25:   // Successive dimensions can be collapsed if the size/strides match
26:   // up and thus there are no holes between the dimensions. This is used
27:   // to reduce the complexity of the problem.
28:   // The optional `reduceDim` indicates a reduction dimension for the
29:   // given tensor, so that the output size for this dimension will be 1.
30:
31:   int sizes[25];
32:   int strides[25];
33:   int offset;
34:   int dims;
35: } TensorInfoCl;
36: // Contiguous tensors of more than one dimension are collapsed down
37: // to one tensor
38:
39:
40: // Translate a linear index for the apply to a float* offset;
41: // specialized on `Dims` to reduce nvcc compilation time
42:
43:
44: inline int IndexToOffset_998_get(int linearId, global const TensorInfoCl *info) {
45:     return linearId + info->offset;
46: }
47:
48: inline int IndexToOffset_999_get(int linearId, global const TensorInfoCl *info) {
49:   int offset = info->offset;
50:
51:   // Use dynamic dims
52:   for (int i = info->dims - 1; i >= 0; --i) {
53:     int curDimIndex = linearId -1180192744nfo->sizes[i];
54:     int curDimOffset = curDimIndex * info->strides[i];
55:     offset += curDimOffset;
56:
57:     linearId /= info->sizes[i];
58:   }
59:
60:   return offset;
61: }
62:
63: inline int getLinearBlockId() {
64:   return get_group_id(2) * get_num_groups(1) * get_num_groups(0) +
65:     ge
stack traceback:
    [C]: in function 'sum'
    ...huxiaohu/torch/install/share/lua/5.1/clnn/LogSoftMax.lua:23: in function 'updateOutput'
    ...s/zhuxiaohu/torch/install/share/lua/5.1/rnn/Recursor.lua:24: in function 'updateOutput'
    .../zhuxiaohu/torch/install/share/lua/5.1/rnn/Sequencer.lua:47: in function 'updateOutput'
    .../zhuxiaohu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:79: in function 'train'
    train.lua:82: in main chunk
    [C]: in function 'dofile'
    ...aohu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:143: in main chunk
    [C]: at 0x0109d1cbc0

Seems like I should implement some methods in clnn LogSoftMax.lua.

hughperkins commented 8 years ago

In theory, this was fixed yesterday, in https://github.com/hughperkins/cltorch/commit/2df6d9928d16613ef1cbc51e1324ae929db95979 . Do you mind installing the latest cltorch, luarocks install cltorch, and confirm the problem does/doesnt persist?

tigerneil commented 8 years ago
bash-3.2$ th train.lua --dataset 1000 --hiddenSize 100
libthclnn_searchpath    /Users/zhuxiaohu/torch/install/lib/lua/5.1/libTHCLNN.so
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 2536
         Examples: 1569
Using Apple , OpenCL platform: Apple
Using OpenCL device: HD Graphics 4000

-- Epoch 1 / 50

/Users/zhuxiaohu/torch/install/bin/luajit: ...u/torch/install/share/lua/5.1/clnn/ClassNLLCriterion.lua:23: Input to clnn.ClassNLLCriterion should be 2-d tensor
stack traceback:
    [C]: in function 'error'
    ...u/torch/install/share/lua/5.1/clnn/ClassNLLCriterion.lua:23: in function 'forward'
    ...u/torch/install/share/lua/5.1/rnn/SequencerCriterion.lua:27: in function 'forward'
    ./seq2seq.lua:80: in function 'train'
    train.lua:81: in main chunk
    [C]: in function 'dofile'
    ...aohu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:143: in main chunk
    [C]: at 0x0107e5fbc0

the above problem fixed, however new error occurs. any comments?

hughperkins commented 8 years ago
ubuntu:~/git/neuralconvo-cl$ th train.lua --dataset 1000 --hiddenSize 100
libthclnn_searchpath    /home/ubuntu/torch/install/lib/lua/5.1/libTHCLNN.so 
-- Loading dataset  
data/vocab.t7 not found 
-- Parsing Cornell movie dialogs data set ...   
/home/ubuntu/torch/install/bin/luajit: ./cornell_movie_dialogs.lua:6: data/cornell_movie_dialogs/movie_lines.txt: No such file or directory
hughperkins commented 8 years ago

I get this output. an error, but different from yours:

ubuntu:~/git/neuralconvo-cl$ th train.lua --dataset 1000 --hiddenSize 100
libthclnn_searchpath    /home/ubuntu/torch/install/lib/lua/5.1/libTHCLNN.so 
-- Loading dataset  
Loading vocabulary from data/vocab.t7 ...   

Dataset stats:  
  Vocabulary size: 2536 
         Examples: 1569 
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M

-- Epoch 1 / 50 

/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/nn/LookupTable.lua:85: attempt to call field 'LookupTable_accGradParameters' (a nil value)
stack traceback:
hughperkins commented 8 years ago

To diagnose the error I see, I opened ~/torch/install/share/lua/5.1/clnn/LookupTable.lua, and added at start of accGradParameters:

print('torch.type(input)', torch.type(input))

Result when running:

torch.type(input)   torch.IntTensor 

I think input should be ClTensor in fact?

hughperkins commented 8 years ago

Seems like you're missing conversion of input and target to cl at lines 77-80 of train.lua:

      if options.cuda then
        input = input:cuda()
        target = target:cuda()
      end
hughperkins commented 8 years ago

Training seems to work ok for me now:

clnn lookuptable.accGradParameters()...... 90/1569 A: 2m8s | Step: 86ms         
torch.type(input)   torch.ClTensor  
torch.type(gradOutput)  torch.ClTensor  
clnn lookuptable.accGradParameters()    
torch.type(input)   torch.ClTensor  
torch.type(gradOutput)  torch.ClTensor  
clnn lookuptable.accGradParameters()...... 91/1569 A: 2m7s | Step: 86ms         
torch.type(input)   torch.ClTensor  
torch.type(gradOutput)  torch.ClTensor  
clnn lookuptable.accGradParameters()    
torch.type(input)   torch.ClTensor  
torch.type(gradOutput)  torch.ClTensor  
clnn lookuptable.accGradParameters()...... 92/1569 A: 2m7s | Step: 86ms         
torch.type(input)   torch.ClTensor  
torch.type(gradOutput)  torch.ClTensor  
clnn lookuptable.accGradParameters()    
torch.type(input)   torch.ClTensor  

(Edit, basically I added the following at line 81 of train.lua:

      input = input:cl()
      target = target:cl()

)

tigerneil commented 8 years ago

after I update my torch, it works well.

luarocks install torch
luarocks install nn
luarocks install nngraph
luarocks install cltorch
luarcoks install clnn

@hughperkins thanks for your patient help.

hughperkins commented 8 years ago

Cool :-)

tigerneil commented 8 years ago

https://github.com/macournoyer/neuralconvo/issues/14 @hughperkins new problems occurs.

evaluation failes with the following error:

./seq2seq.lua:124: attempt to call method 'sort' (a nil value)
stack traceback:
    ./seq2seq.lua:124: in function 'eval'
    eval.lua:70: in function 'say'
    [string "_RESULT={say "hello"}"]:1: in main chunk
    [C]: in function 'xpcall'
    /Users/zhuxiaohu/torch/install/share/lua/5.1/trepl/init.lua:650: in function 'repl'
    ...aohu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:197: in main chunk
    [C]: at 0x0104929bc0
tigerneil commented 8 years ago
 -9.7471
 -9.4275
 -9.8941
 -9.7114
 -9.6610
 -9.8075
 -9.9348
 -9.7718
 -9.7425
 -9.7685
 -9.7888
 -9.7118
 -9.8315
 -9.7752
 -9.5964
 -9.7598
 -9.8950
 -9.5297
 -9.7994
 -9.8127
 -9.6132
[torch.ClTensor of size 2536]

I try to add a line before error occurs print out the value of prediction. it seems normal compared with cutorch version.

hughperkins commented 8 years ago

Hmmm... I'm not sure I implemented/ported sort. I'm not sure how much of a performance bottleneck the sort step is in your algo? One thing you could try initially perhaps is simply convert that tensor into main memory, and sort on cpu, something like:

local predictionfloat = prediction:float()
local prob, wordIds = predictionfloat:sort(1, true)

If this doesnt affect performance too much, maybe this is acceptable? If this sort step is relatively slow, compared to the rest of the program, then we probably need to rethink. Note that implementing sort on the gpu is non-trivial.

hughperkins commented 8 years ago

(Hmmm, seems like you dont actually need to sort. You just need the max? :

    -- First one is the most likely.
    output = wordIds[1]

?

If this is is the case, you could simply use max instead, like, something like (might not be quite correct):

local max, index = prediction:max(1)

)

tigerneil commented 8 years ago

Hi, in this evaluation stage, the memory we need is corresponding to the vocabulary size. Your first suggestion is OK. evaluation runs correctly, I can talk with the model now.

I will try larger hiddenSize and run complete training epoch to see what happen then. It should be OK.

maybe it is interesting to see how to implement sort in GPU. :)

thanks for your help and suggestion.

hughperkins commented 8 years ago

Cool. Do let me know if the sort starts becoming a bottleneck. I'm not saying I'll be able to fix it in such a case, but I will at least know it's a pain-point that would be good to deal with.