char-rnn: nans at training

xonobo commented 9 years ago

I couldn't run the test routines. I guess they are available in the latest version. But I couldn't updated my cltorch in an offline fashion. That's why reopened the issue #9 .

clTorch output:

th> require('cutorch') { streamWaitFor : function: 0x4058b558 deviceReset : function: 0x4058ba98 test : function: 0x409a58f0 _state : userdata: 0x0220bca0 streamSynchronize : function: 0x4058b928 manualSeed : function: 0x4058bf20 setStream : function: 0x4058b2b0 getMemoryUsage : function: 0x4058bcd8 setDefaultStream : function: 0x4058b488 getBlasHandle : function: 0x4058b030 CudaHostAllocator : torch.Allocator getNumStreams : function: 0x4058b1e0 manualSeedAll : function: 0x4058bfe0 initialSeed : function: 0x4058bec0 getStream : function: 0x4058b370 setRNGState : function: 0x405840b0 setBlasHandle : function: 0x4058af68 seed : function: 0x4058bda0 getDeviceProperties : function: 0x4058bc18 reserveStreams : function: 0x4058b100 withDevice : function: 0x409a5958 setDevice : function: 0x4058bd40 seedAll : function: 0x4058be60 getNumBlasHandles : function: 0x4058af00 getDeviceCount : function: 0x4058bb58 createCudaHostTensor : function: 0x409a5998 getState : function: 0x40584170 getDevice : function: 0x4058b9b0 synchronize : function: 0x4058ad00 getRNGState : function: 0x40584048 streamWaitForMultiDevice : function: 0x4058b650 reserveBlasHandles : function: 0x4058ae30 streamBarrierMultiDevice : function: 0x4058b838 streamBarrier : function: 0x4058b720 } [1.0545s] th> require('cltorch') { setAddFinish : function: 0x41f1bd90 getDeviceCount : function: 0x41f1bbd8 getDeviceProperties : function: 0x41f1bcc8 getState : function: 0x41f1bcf0 getDevice : function: 0x412c43e0 setDevice : function: 0x412c4440 _state : userdata: 0x02467900 dumpTimings : function: 0x41f1bb28 setTrace : function: 0x41f1bd40 synchronize : function: 0x412c4468 finish : function: 0x41f1bbb0 } [0.0146s] th> cltorch.getDeviceProperties(1) { deviceType : "CPU" maxClockFrequency : 2000 deviceName : " Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz" maxMemAllocSizeMB : 16089 globalMemCachelineSizeKB : 0 deviceVersion : "OpenCL 1.2 (Build 43)" localMemSizeKB : 32 openClCVersion : "OpenCL C 1.2 " maxWorkGroupSize : 8192 globalMemSizeMB : 64358 platformVendor : "Intel(R) Corporation" maxComputeUnits : 32 } [0.0011s] th> cltorch.getDeviceProperties(2) { deviceType : "GPU" maxClockFrequency : 705 deviceName : "Quadro 410" maxMemAllocSizeMB : 128 globalMemCachelineSizeKB : 0 deviceVersion : "OpenCL 1.1 CUDA" localMemSizeKB : 47 openClCVersion : "OpenCL C 1.1 " maxWorkGroupSize : 1024 globalMemSizeMB : 509 platformVendor : "NVIDIA Corporation" maxComputeUnits : 1 }

training on CPU

th train.lua -data_dir data/tinyshakespeare/ -opencl 0 -gpuid -1 vocab.t7 and data.t7 do not exist. Running preprocessing... one-time setup: preprocessing input text file data/tinyshakespeare/input.txt... loading text file...

creating vocabulary mapping...

putting data into tensor... saving data/tinyshakespeare/vocab.t7

saving data/tinyshakespeare/data.t7 loading data files...

cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0

vocab size: 65

creating an LSTM with 2 layers

number of parameters in the model: 240321

cloning criterion

cloning rnn 1/21150 (epoch 0.002), train_loss = 4.19766416, grad/param norm = 4.5006e-01, time/batch = 1.74s

2/21150 (epoch 0.005), train_loss = 4.10134056, grad/param norm = 6.3375e-01, time/batch = 2.14s

3/21150 (epoch 0.007), train_loss = 3.44502399, grad/param norm = 9.4798e-01, time/batch = 1.48s

4/21150 (epoch 0.009), train_loss = 3.45054399, grad/param norm = 1.1340e+00, time/batch = 2.03s

5/21150 (epoch 0.012), train_loss = 3.33238818, grad/param norm = 7.8976e-01, time/batch = 1.62s

6/21150 (epoch 0.014), train_loss = 3.37363688, grad/param norm = 7.0334e-01, time/batch = 1.61s

7/21150 (epoch 0.017), train_loss = 3.36438210, grad/param norm = 6.5300e-01, time/batch = 2.70s

8/21150 (epoch 0.019), train_loss = 3.33342581, grad/param norm = 7.6950e-01, time/batch = 1.72s

9/21150 (epoch 0.021), train_loss = 3.29173263, grad/param norm = 6.1282e-01, time/batch = 1.79s

10/21150 (epoch 0.024), train_loss = 3.38000728, grad/param norm = 4.1881e-01, time/batch = 2.92s

training on GPU - CUDA

th train.lua -data_dir data/tinyshakespeare/ -opencl 0 -gpuid 0 using CUDA on GPU 0...

loading data files...

cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0

vocab size: 65

creating an LSTM with 2 layers

number of parameters in the model: 240321

cloning criterion

cloning rnn 1/21150 (epoch 0.002), train_loss = 4.16315975, grad/param norm = 4.5507e-01, time/batch = 0.50s

2/21150 (epoch 0.005), train_loss = 4.06560737, grad/param norm = 6.1592e-01, time/batch = 0.45s

3/21150 (epoch 0.007), train_loss = 3.50594769, grad/param norm = 1.2221e+00, time/batch = 0.45s

4/21150 (epoch 0.009), train_loss = 3.45355825, grad/param norm = 1.3675e+00, time/batch = 0.44s

5/21150 (epoch 0.012), train_loss = 3.35222242, grad/param norm = 1.2052e+00, time/batch = 0.44s

6/21150 (epoch 0.014), train_loss = 3.37636928, grad/param norm = 8.7048e-01, time/batch = 0.44s

7/21150 (epoch 0.017), train_loss = 3.36737326, grad/param norm = 6.1815e-01, time/batch = 0.44s

8/21150 (epoch 0.019), train_loss = 3.32496874, grad/param norm = 4.2533e-01, time/batch = 0.44s

9/21150 (epoch 0.021), train_loss = 3.29095509, grad/param norm = 4.5369e-01, time/batch = 0.44s

10/21150 (epoch 0.024), train_loss = 3.38070163, grad/param norm = 4.3267e-01, time/batch = 0.44s

training on GPU - OPENCL

th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 1 registering spatialconvolutionmm using OpenCL on GPU 1...

loading data files...

cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0

vocab size: 65

creating an LSTM with 2 layers

Using NVIDIA Corporation platform: NVIDIA CUDA Using device: Quadro 410 statefultimer v0.6 number of parameters in the model: 240321

cloning criterion

cloning rnn 1/21150 (epoch 0.002), train_loss = 4.19766393, grad/param norm = 4.5006e-01, time/batch = 1.31s

2/21150 (epoch 0.005), train_loss = 4.10134039, grad/param norm = 6.3375e-01, time/batch = 1.10s

3/21150 (epoch 0.007), train_loss = 3.44484827, grad/param norm = 9.4796e-01, time/batch = 1.11s

4/21150 (epoch 0.009), train_loss = 3.45040853, grad/param norm = 1.1346e+00, time/batch = 1.10s

5/21150 (epoch 0.012), train_loss = 3.33218116, grad/param norm = 7.8938e-01, time/batch = 1.09s

6/21150 (epoch 0.014), train_loss = 3.37349831, grad/param norm = 7.0234e-01, time/batch = 1.04s

7/21150 (epoch 0.017), train_loss = 3.36418301, grad/param norm = 6.5344e-01, time/batch = 0.96s

8/21150 (epoch 0.019), train_loss = 3.33336397, grad/param norm = 7.7021e-01, time/batch = 0.95s

9/21150 (epoch 0.021), train_loss = 3.29151368, grad/param norm = 6.1312e-01, time/batch = 0.95s

10/21150 (epoch 0.024), train_loss = 3.37983895, grad/param norm = 4.1876e-01, time/batch = 0.95s

training on CPU - OPENCL

th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0 registering spatialconvolutionmm using OpenCL on GPU 0...

loading data files...

cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0

vocab size: 65

creating an LSTM with 2 layers

Using Intel(R) Corporation platform: Intel(R) OpenCL Using device: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz statefultimer v0.6 THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. number of parameters in the model: 240321

cloning criterion

cloning rnn THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (8) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClTensorMathTransformReduce.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClReduce.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. /home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClGather.cpp build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClReduceAll.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (8) Kernel was successfully vectorized (4) Kernel was successfully vectorized (8) Done. /home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClScatter.cpp build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClReduceAll.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (8) Kernel was successfully vectorized (4) Kernel was successfully vectorized (8) Done. THClReduceAll.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (8) Kernel was successfully vectorized (4) Kernel was successfully vectorized (8) Done. 1/21150 (epoch 0.002), train_loss = nan, grad/param norm = 1.0629e+02, time/batch = 84.92s

loss is exploding, aborting.

training on CPU - OPENCL - fork of hughperkins

th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0 registering spatialconvolutionmm using OpenCL on GPU 0...

loading data files...

cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0

vocab size: 65

creating an LSTM with 2 layers

Using Intel(R) Corporation platform: Intel(R) OpenCL Using device: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz statefultimer v0.6 THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. number of parameters in the model: 240321

cloning criterion

cloning rnn THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (8) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClTensorMathTransformReduce.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClReduce.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. /home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClGather.cpp build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClReduceAll.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (8) Kernel was successfully vectorized (4) Kernel was successfully vectorized (8) Done. /home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClScatter.cpp build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was not vectorized Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClApply.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (4) Done. THClReduceAll.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (8) Kernel was successfully vectorized (4) Kernel was successfully vectorized (8) Done. THClReduceAll.cl build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel was successfully vectorized (8) Kernel was successfully vectorized (4) Kernel was successfully vectorized (8) Done. 1/21150 (epoch 0.002), train_loss = nan, grad/param norm = 1.0629e+02, time/batch = 87.60s

2/21150 (epoch 0.005), train_loss = nan, grad/param norm = 1.0279e+02, time/batch = 2.42s

3/21150 (epoch 0.007), train_loss = nan, grad/param norm = 9.8932e+01, time/batch = 2.40s

4/21150 (epoch 0.009), train_loss = nan, grad/param norm = 9.5078e+01, time/batch = 2.41s

5/21150 (epoch 0.012), train_loss = nan, grad/param norm = 9.1377e+01, time/batch = 2.38s

6/21150 (epoch 0.014), train_loss = nan, grad/param norm = 8.7885e+01, time/batch = 2.43s

7/21150 (epoch 0.017), train_loss = nan, grad/param norm = 8.4618e+01, time/batch = 2.38s

8/21150 (epoch 0.019), train_loss = nan, grad/param norm = 8.1572e+01, time/batch = 2.38s

9/21150 (epoch 0.021), train_loss = nan, grad/param norm = 7.8735e+01, time/batch = 2.38s

10/21150 (epoch 0.024), train_loss = nan, grad/param norm = 7.6093e+01, time/batch = 2.38s

11/21150 (epoch 0.026), train_loss = nan, grad/param norm = 7.3629e+01, time/batch = 2.64s

12/21150 (epoch 0.028), train_loss = nan, grad/param norm = 7.1329e+01, time/batch = 2.44s

13/21150 (epoch 0.031), train_loss = nan, grad/param norm = 6.9178e+01, time/batch = 2.40s

14/21150 (epoch 0.033), train_loss = nan, grad/param norm = 6.7162e+01, time/batch = 2.40s

15/21150 (epoch 0.035), train_loss = 4.19833310, grad/param norm = 2.8035e-01, time/batch = 2.42s

16/21150 (epoch 0.038), train_loss = nan, grad/param norm = 6.5226e+01, time/batch = 2.38s

17/21150 (epoch 0.040), train_loss = nan, grad/param norm = 6.3410e+01, time/batch = 2.39s

18/21150 (epoch 0.043), train_loss = nan, grad/param norm = 6.1704e+01, time/batch = 2.48s

19/21150 (epoch 0.045), train_loss = nan, grad/param norm = 6.0097e+01, time/batch = 2.56s

20/21150 (epoch 0.047), train_loss = nan, grad/param norm = 5.8581e+01, time/batch = 2.62s

21/21150 (epoch 0.050), train_loss = 4.20001215, grad/param norm = 2.8320e-01, time/batch = 2.57s

22/21150 (epoch 0.052), train_loss = nan, grad/param norm = 5.7113e+01, time/batch = 2.43s

23/21150 (epoch 0.054), train_loss = nan, grad/param norm = 5.5728e+01, time/batch = 2.40s

24/21150 (epoch 0.057), train_loss = nan, grad/param norm = 5.4416e+01, time/batch = 2.38s

25/21150 (epoch 0.059), train_loss = nan, grad/param norm = 5.3173e+01, time/batch = 2.37s

xonobo commented 9 years ago

According to the comments at karpathy/char-rnn#58 I understand that cltorch will not support CPU utilization as an opencl device.

I think it would be interesting to compare the cpu parallelization and opencl parallelization performances of torch for bare CPU architectures.

hughperkins commented 9 years ago

So, cltorch totally doesnt support using the CPU bit of a CPU on OpenCL. A couple of reasons for this:

first is that torch already handles cpu training, and it's probably quite well optimized for that
second is that cltorch is highly optimized for gpus. The patterns of data access and so on will work incredibly poorly on the cpu bit of cpus

Technically, it's probably technically possible to get cltorch to work on CPUs but the amount of work required would be phenomenal, and it would still probably be slower than just using standard cpu-based torch.

xonobo commented 9 years ago

test results are perfect :)

th -l cltorch -e 'cltorch.test()' running tests...
aftter requiring cltorch.unit_storage
Running 1 tests | ==> testbasic
Using NVIDIA Corporation platform: NVIDIA CUDA Using device: Quadro 410 ==> Done Completed 11 asserts in 1 tests with 0 errors
... ==> Done Completed 156 asserts in 90 tests with 0 errors
all tests finished

hughperkins commented 9 years ago

Cool :-)

hughperkins / cltorch