Closed Ambext closed 8 years ago
(Basically, for anyone else looking at this issue, some more detali, from earlier thread: when running Karpathy's char-rnn, using cltorch/opencl, on an AMD M295X device, following error:
output is now
Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 1
registering spatialconvolutionmm
using OpenCL on GPU 1...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Apple platform: Apple
Using device: AMD Radeon R9 M295X Compute Engine
statefultimer v0.6
THClApply.cl build log:
:29:7: warning: no previous prototype for function 'reduceOp'
float reduceOp(float _in1, float _in2) {
^
:49:6: warning: no previous prototype for function 'TensorInfo_isContiguous'
bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) {
^
:57:14: warning: no previous prototype for function 'IndexToOffset_998_get'
unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) {
^
:61:14: warning: no previous prototype for function 'IndexToOffset_999_get'
unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) {
^
:76:14: warning: no previous prototype for function 'getLinearBlockId'
unsigned int getLinearBlockId() {
^
:84:7: warning: no previous prototype for function 'reduceBlock'
float reduceBlock( local float* smem,
^
:92:23: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned int') and 'int'
if (get_local_id(0) < numVals) {
~~~~~~~~~~~~~~~ ^ ~~~~~~~
:99:31: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned int') and 'int'
float r = get_local_id(0) < numVals ? smem[get_local_id(0)] : init;
~~~~~~~~~~~~~~~ ^ ~~~~~~~
:198:6: warning: no previous prototype for function 'op'
void op( global float *out
^
/Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:76: bad argument #1 to 'set' (expecting number or Tensor or Storage)
stack traceback:
[C]: in function 'set'
./util/model_utils.lua:76: in function 'flatten'
./util/model_utils.lua:103: in function 'combine_all_parameters'
train.lua:160: in main chunk
[C]: in function 'dofile'
...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x010fd87320
)
Having cloned the branch no-prototype warnings
git clone https://github.com/hughperkins/cltorch.git -b no-prototype-warnings cltorch-warnings cd cltorch-warnings luarocks make rocks/cltorch-scm-1.rockspec
Executing train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 1
Yields
Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 1
registering spatialconvolutionmm
using OpenCL on GPU 1...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
createForIndexedgpu gpu=1
createForindexedgpu gpu=1 verbose=1
gpu=1 currentGpuIndex=0 num_devices=1
/Users/mnemonis/torch/install/bin/luajit: C++ exception
Exowide:char-rnn mnemonis$
oh, can you use -gpuid 0
now pelase :-P (latest version ignores cpu-type devices now)
Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0
registering spatialconvolutionmm
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
createForIndexedgpu gpu=0
createForindexedgpu gpu=0 verbose=1
gpu=0 currentGpuIndex=0 num_devices=1
Using Apple platform: Apple
Using device: AMD Radeon R9 M295X Compute Engine
statefultimer v0.6
THClApply.cl build log:
(though admittedly it should produce a less vicious error message than that...)
My apologies, I forgot to re-run the genreator step, so the cl code you received was unchanged. can you go into the directory from the earlier clone, and do git pull
and rerun luarocks make
pelase? :
cd cltorch-warnings
git pull
luarocks make rocks/cltorch-scm-1.rockspec
(edit: and then rerun the th train.lua
step, as earlier)
no worries at all
Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0
registering spatialconvolutionmm
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
createForIndexedgpu gpu=0
createForindexedgpu gpu=0 verbose=1
gpu=0 currentGpuIndex=0 num_devices=1
Using Apple platform: Apple
Using device: AMD Radeon R9 M295X Compute Engine
statefultimer v0.6
THClApply.cl build log:
Hmmm, that invalid workgroup size error is pretty intereseting. Thats the problem that is causing the crash. Do you mind doign the following please?
th
require 'cltorch'
cltorch.getDeviceProperties(1)
(Ok (just got your update), ok, seems we got rid of the reduceOp warning about no prototype. I'll make that change on the other methods. )
th> require 'cltorch' { finish : function: 0x05ba70b8 getState : function: 0x05ba75d0 getDeviceCount : function: 0x05ba7108 setTrace : function: 0x05ba7620 setAddFinish : function: 0x05ba7670 setDevice : function: 0x05ba7000 synchronize : function: 0x05ba7028 _state : userdata: 0x7fac78600160 dumpTimings : function: 0x05ba76c0 getDevice : function: 0x05bb80b8 getDeviceProperties : function: 0x05ba7580 } [0.0093s] th> cltorch.getDeviceProperties(1) { deviceType : "CPU" localMemSizeKB : 32 globalMemSizeMB : 16384 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz" maxComputeUnits : 8 globalMemCachelineSizeKB : 8192 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 4000 maxMemAllocSizeMB : 4096 maxWorkGroupSize : 1024 } [0.0001s] th>
Hmmm. I think you have an earlier version, with a bug on getDeviceProeprties. Do you mind also doing cltorch.getDeviceProperties(2)
please?
Oh, theres a bug, soyou cant. let me fix that :-P
I can it yields
th> cltorch.getDeviceProperties(2) { deviceType : "GPU" localMemSizeKB : 32 globalMemSizeMB : 4096 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "AMD Radeon R9 M295X Compute Engine" maxComputeUnits : 32 globalMemCachelineSizeKB : 0 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 850 maxMemAllocSizeMB : 1024 maxWorkGroupSize : 256 }
(and I should have figured out by myself that you wanted the GPU properties...)
Thanks! Hmmm, workgroupsize 256. that should be ok. hmmm... pondering...
Oh, but I bet it's linked with the bug in getDeviceProperties
Looking into this.
Ok, seems it's not easy for me to test the getpreoprties etc, because all my machines have either only opencl gpus, or only opencl cpus, but not both :-P Do you mind doing a git pull
, and trying again?
cd cltorch-warnings
git pull
luarocks make rocks/cltorch-scm-1.rockspec
When you do th train.lua
, as far as I know, I think it needs currnetly -gpuid 0
.
You know what, I do have a machien I can check the deviceproperties bit on. Let me test that first...
ok. FYI I will be afk for ~ 3 hours starting in 5 minutes
Hmmm, ok. Well, the deviceproperties thing is fixed now. And I fixed the earlier warnings. Do you want to run now, and paste whatever comes out? (I think there are a ton of new warnings :-/ )
Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0
registering spatialconvolutionmm
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Apple platform: Apple
Using device: AMD Radeon R9 M295X Compute Engine
statefultimer v0.6
THClApply.cl build log:
Hmm, ok :-P And can you provide the output of cltorch.getDeviceProperties(1)
please?
th> cltorch.getDeviceProperties(1) { deviceType : "CPU" localMemSizeKB : 32 globalMemSizeMB : 16384 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz" maxComputeUnits : 8 globalMemCachelineSizeKB : 8192 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 4000 maxMemAllocSizeMB : 4096 maxWorkGroupSize : 1024 } [0.0001s] th> cltorch.getDeviceProperties(2) { deviceType : "GPU" localMemSizeKB : 32 globalMemSizeMB : 4096 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "AMD Radeon R9 M295X Compute Engine" maxComputeUnits : 32 globalMemCachelineSizeKB : 0 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 850 maxMemAllocSizeMB : 1024 maxWorkGroupSize : 256 }
Ummm... thats odd. you shouldnt see the cpu device.
It seems like you have an old version of easycl perhaps. When you get back, can you do:
cd cltorch-warnings
rm -Rf build
git pull
git log -n 3 --oneline
luarocks make rocks/cltorch-scm-1.rockspec
(cd EasyCL; git log -n 3 --oneline)
... and then retry the cltorch.getDeviceProperties(1)
, and cltorch.getDeviceCount()
please?
Note: since 'static' gives a bunch of warnings on Hawaii, I've changed to inline
. Not sure how well that works on the 295x? Thank-you for all your help in fixing this problem by the way :-)
you are welcome - thanks for doing this
th> cltorch.getDeviceProperties(1) { deviceType : "GPU" localMemSizeKB : 32 globalMemSizeMB : 4096 deviceVersion : "OpenCL 1.2 " platformVendor : "Apple" deviceName : "AMD Radeon R9 M295X Compute Engine" maxComputeUnits : 32 globalMemCachelineSizeKB : 0 openClCVersion : "OpenCL C 1.2 " maxClockFrequency : 850 maxMemAllocSizeMB : 1024 maxWorkGroupSize : 256 } [0.0001s] th> cltorch.getDeviceCount() 1
Ah, that output is what I would expect to see. Can you try running char-rnn again please?
Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0
registering spatialconvolutionmm
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Apple platform: Apple
Using device: AMD Radeon R9 M295X Compute Engine
statefultimer v0.6
kernel source:
1: // OpenCL kernels....
2:
3: // expected templated values:
4: // dims (vector of unique dimension values)
5: // operation
6: // dim1
7: // dim2
8: // dim3
9: // ... dimD
10: // num_input_tensors
11: // include_scalar_input
12: //
13: // maybe should add:
14: // IndexType (hardcoded to int for now)
15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now)
16:
17: // (Ported from cutorch's THCApply.cuh)
18:
19: // Maximum number of dimensions allowed for cutorch
20: // #define MAX_CUTORCH_DIMS 25
21:
22: // Enum that indicates whether tensor arguments are read/write or
23: // read-only
24: //enum TensorArgType { ReadWrite, ReadOnly };
25:
26: // not used by this kernel, but used by THClReduceApplyUtils...
27: inline float reduceOp(float _in1, float _in2) {
28: return 0;
29: }
30:
31: // kernel argument that defines tensor layout
32: typedef struct TensorInfoCl {
33: // Extracts size/stride information for the kernel.
34: // Successive dimensions can be collapsed if the size/strides match
35: // up and thus there are no holes between the dimensions. This is used
36: // to reduce the complexity of the problem.
37: // The optional reduceDim
indicates a reduction dimension for the
38: // given tensor, so that the output size for this dimension will be 1.
39:
40: unsigned int sizes[25];
41: unsigned int strides[25];
42: unsigned int offset;
43: int dims;
44: } TensorInfoCl;
45: // Contiguous tensors of more than one dimension are collapsed down
46: // to one tensor
47: inline bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) {
48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1);
49: }
50:
51: // Translate a linear index for the apply to a float* offset;
52: // specialized on Dims
to reduce nvcc compilation time
53:
54:
55: inline unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) {
56: return linearId + info.offset;
57: }
58:
59: inline unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) {
60: unsigned int offset = info.offset;
61:
62: // Use dynamic dims
63: for (int i = info.dims - 1; i >= 0; --i) {
64: unsigned int curDimIndex = linearId % info.sizes[i];
65: unsigned int curDimOffset = curDimIndex * info.strides[i];
66: offset += curDimOffset;
67:
68: linearId /= info.sizes[i];
69: }
70:
71: return offset;
72: }
73:
74: inline unsigned int getLinearBlockId() {
75: return get_group_id(2) * get_num_groups(1) * get_num_groups(0) +
76: get_group_id(1) * get_num_groups(0) +
77: get_group_id(0);
78: }
79:
80: // Block-wide reduction in shared memory helper; only /threadIdx.x/ get_local_id(0) == 0 will
81: // return the reduced value
82: inline float reduceBlock( local float* smem,
83: int numVals,
84: float threadVal,
85: float init) {
86: if (numVals == 0) {
87: return init;
88: }
89:
90: if ((int)get_local_id(0) < numVals) {
91: smem[ get_local_id(0)] = threadVal;
92: }
93:
94: // First warp will perform reductions across warps
95: barrier(CLK_LOCAL_MEM_FENCE);
96: if ((get_local_id(0) / 32) == 0) {
97: float r = (int)get_local_id(0) < numVals ? smem[get_local_id(0)] : init;
98:
99: for (int i = 32 + get_local_id(0); i < numVals; i += 32) {
100: r = reduceOp(r, smem[i]);
101: }
102:
103: smem[get_local_id(0)] = r;
104: }
105:
106: // First thread will perform reductions across the block
107: barrier(CLK_LOCAL_MEM_FENCE);
108:
109: float r = init;
110: if (get_local_id(0) == 0) {
111: r = smem[0];
112:
113: int numLanesParticipating = min(numVals, 32);
114:
115: if (numLanesParticipating == 32) {
116: // Unroll for 32 == 32 and numVals >= 32
117: // #pragma unroll
118: // unrolling by hand, so compiler-independent
119:
120: r = reduceOp(r, smem[1]);
121:
122: r = reduceOp(r, smem[2]);
123:
124: r = reduceOp(r, smem[3]);
125:
126: r = reduceOp(r, smem[4]);
127:
128: r = reduceOp(r, smem[5]);
129:
130: r = reduceOp(r, smem[6]);
131:
132: r = reduceOp(r, smem[7]);
133:
134: r = reduceOp(r, smem[8]);
135:
136: r = reduceOp(r, smem[9]);
137:
138: r = reduceOp(r, smem[10]);
139:
140: r = reduceOp(r, smem[11]);
141:
142: r = reduceOp(r, smem[12]);
143:
144: r = reduceOp(r, smem[13]);
145:
146: r = reduceOp(r, smem[14]);
147:
148: r = reduceOp(r, smem[15]);
149:
150: r = reduceOp(r, smem[16]);
151:
152: r = reduceOp(r, smem[17]);
153:
154: r = reduceOp(r, smem[18]);
155:
156: r = reduceOp(r, smem[19]);
157:
158: r = reduceOp(r, smem[20]);
159:
160: r = reduceOp(r, smem[21]);
161:
162: r = reduceOp(r, smem[22]);
163:
164: r = reduceOp(r, smem[23]);
165:
166: r = reduceOp(r, smem[24]);
167:
168: r = reduceOp(r, smem[25]);
169:
170: r = reduceOp(r, smem[26]);
171:
172: r = reduceOp(r, smem[27]);
173:
174: r = reduceOp(r, smem[28]);
175:
176: r = reduceOp(r, smem[29]);
177:
178: r = reduceOp(r, smem[30]);
179:
180: r = reduceOp(r, smem[31]);
181:
182: } else {
183: for (int i = 1; i < numLanesParticipating; ++i) {
184: r = reduceOp(r, smem[i]);
185: }
186: }
187: }
188:
189: return r;
190: }
191:
192:
193:
194:
195:
196: inline void op( global float _out
197:
198:
199: , float val1
200:
201: ) {
202: out = val1;
203: }
204:
205: kernel void
206: THClTensor_pointwiseApplyD(
207:
208: global TensorInfoCl info_1,
209: global float_data_1,
210:
211:
212: float val1,
213:
214: int totalElements) {
215: for (int linearIndex = get_global_id(0);
216: linearIndex < totalElements;
217: linearIndex += get_global_size(0)) {
218:
219: // Convert linearIndex
into an offset of a
220: const int offset1 =
221: IndexToOffset_998_get(linearIndex, info_1[0]);
222:
223:
224: op(
225:
226:
227: &(data_1[offset1])
228:
229:
230: , val1
231:
232: );
233: }
234: }
235:
236:
Invalid work group size, code -54
/Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:56:
kernel source:
1: // OpenCL kernels....
2:
3: // expected templated values:
4: // dims (vector of unique dimension values)
5: // operation
6: // dim1
7: // dim2
8: // dim3
9: // ... dimD
10: // num_input_tensors
11: // include_scalar_input
12: //
13: // maybe should add:
14: // IndexType (hardcoded to int for now)
15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now)
16:
17: // (Ported from cutorch's THCApply.cuh)
18:
19: // Maximum number of dimensions allowed for cutorch
20: // #define MAX_CUTORCH_DIMS 25
21:
22: // Enum that indicates whether tensor arguments are read/write or
23: // read-only
24: //enum TensorArgType { ReadWrite, ReadOnly };
25:
26: // not used by this kernel, but used by THClReduceApplyUtils...
27: inline float reduceOp(float _in1, float _in2) {
28: return 0;
29: }
30:
31: // kernel argument that defines tensor layout
32: typedef struct TensorInfoCl {
33: // Extracts size/stride information for the kernel.
34: // Successive dimensions can be collapsed if the size/strides match
35: // up and thus there are no holes between the dimensions. This is used
36: // to reduce the complexity of the problem.
37: // The optional reduceDim
indicates a reduction dimension for the
38: // given tensor, so that the output size for this dimension will be 1.
39:
40: unsigned int sizes[25];
41: unsigned int strides[25];
42: unsigned int offset;
43: int dims;
44: } TensorInfoCl;
45: // Contiguous tensors of more than one dimension are collapsed down
46: // to one tensor
47: inline bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) {
48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1);
49: }
50:
51: // Translate a linear index for the apply to a float* offset;
52: // specialized on Dims
to reduce nvcc compilation time
53:
54:
55: inline unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) {
56: return linearId + info.offset;
57: }
58:
59: inline unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl inf
stack traceback:
[C]: in function 'fill'
./util/model_utils.lua:56: in function 'flatten'
./util/model_utils.lua:103: in function 'combine_all_parameters'
train.lua:160: in main chunk
[C]: in function 'dofile'
...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x0101933320
Ok. So, no more compile warnings, thats good :-) The message about invalid workgroup size is odd...
on amd max block dimension is 512 and not 1024, I guess that's the problem
It's failing on ':fill()', which should just call an OpenCL kernel called apply, which runs on other AMD cards which have a workgroupsize of 256.
@szagoruyko Yes, but it's not my first attempt on AMD, and I fixed these issues, in theory, so that it reads maxworkgorupsize from deviceinfo.
Oh, I've found the problem I think this line should use getGpuInfo
, not getDeviceInfo
. And finally we wll arrive at Ambext's original problem, before I hacked around with changing from showing all devices, to showing only GPUs and APUs :-P
Hi @Ambext, can you do git pull
on cltorch, and try again please?
cd cltorch-warnings
git pull
luarocks make rocks/cltorch-scm-1.rockspec
(and then run the char-rnn train.lua script again, as before)
Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0
registering spatialconvolutionmm
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Apple platform: Apple
Using device: AMD Radeon R9 M295X Compute Engine
statefultimer v0.6
/Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:76: Tensor: invalid storage offset at /Users/mnemonis/Documents/Code_Ressources/cltorch-warnings/lib/THCl/THClTensor.cpp:645
stack traceback:
[C]: in function 'set'
./util/model_utils.lua:76: in function 'flatten'
./util/model_utils.lua:103: in function 'combine_all_parameters'
train.lua:160: in main chunk
[C]: in function 'dofile'
...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x010061d320
Exowide:char-rnn mnemonis$
Ah-hah, now we arrive at the original error :-)
Hmmm, thats a pretty bizarre error...
/* storageOffset */
if(storageOffset < 0)
THError("Tensor: invalid storage offset");
self->storageOffset = storageOffset;
There's no logical reason why storageOffset would be negative, under normal circumstances. Must be a bug somewhere else in the program :-(
do you still want me to run the th -l cltorch -e "cltorch.test()" th -l cltorch -e "a = torch.ClTensor(50,100):fill(0.2); cltorch.dumpTimings()" ? the first line yields a massive amount of info. Pasting it here would be non nonsensical.
Hmmm, yeah, lets get the output from that, though I think it all passes ok right?
there are some warnings. I will post a cubby link soon.
ok.
woah. The cltorch.test output txt file is 2.5 Mb!
Exowide:~ mnemonis$ th -l cltorch -e "a = torch.ClTensor(50,100):fill(0.2); cltorch.dumpTimings()" Using Apple platform: Apple Using device: AMD Radeon R9 M295X Compute Engine statefultimer v0.6 StatefulTimer readings: Apply1 2: 0.163818ms Apply1 3: 0.0161133ms Apply1 4: 0.000976562ms Apply1 5: 2.05005ms Apply1 6: 0.00195312ms Apply1 6a: 0.0358887ms Apply1 7: 0.0251465ms Apply1 8: 0.0100098ms Apply1 start: 0.00512695ms before dump: 0.0078125ms
link for the cltorch.test output https://www.cubbyusercontent.com/pli/Cltorch.testt+()+Terminal+Saved+Output.txt/_6cada406541d4eb7a4b9aa3c5308ab0a
Thanks.
Hmmm, its that long, because loads of the tests are failing, and so its printing the result matrices.
loads of 'nan' stuff. we should probably fix that before we look at char-rnn. I suspect that once we've fixed the nan stuff, nad the tests in general, char-rnn will work a bit better :-)
When you get a moment, can you try running the following please?
th -l cltorch -e "require 'cltorch.unit_tensor'; tester = torch.Tester(); tester.countasserts = 0; cltorch.tests.tensor.test_fills()"
th -l cltorch -e "require 'cltorch.unit_tensor'; tester = torch.Tester(); tester.countasserts = 0; cltorch.tests.tensor.test_apply()"
Exowide:~ mnemonis$ th -l cltorch -e "require 'cltorch.unit_tensor'; tester = torch.Tester(); tester.countasserts = 0; cltorch.tests.tensor.test_fills()" Using Apple platform: Apple Using device: AMD Radeon R9 M295X Compute Engine statefultimer v0.6 left 1.3450 1.3450 1.3450 1.3450 1.3450 1.3450 [torch.FloatTensor of size 3x2]
right -2.0359e+38 -2.0359e+38 -2.0359e+38 -2.0359e+38 -2.0359e+38 -2.0359e+38 [torch.FloatTensor of size 3x2]
diff 2.0359e+38 2.0359e+38 2.0359e+38 2.0359e+38 2.0359e+38 2.0359e+38 [torch.FloatTensor of size 3x2]
left 0 0 0 0 0 0 [torch.FloatTensor of size 3x2]
right -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 [torch.FloatTensor of size 3x2]
diff 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 [torch.FloatTensor of size 3x2]
left 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size 3x5]
right -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 -1.8749e+38 [torch.FloatTensor of size 3x5]
diff 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 1.8749e+38 [torch.FloatTensor of size 3x5]
Coming from issue #5