Using Cuda with train-a-digit-classifier

yyaddaden commented 8 years ago

Hi,

I have used the train-a-digit-classifier in a CPU mode and it is worked well. but now I want to test it in a GPU mode. I have a NVIDIA JETSON TK1 where I have installed CUDA 6.5 and all other prerequisites. I have also installed Torch7 and the two packages: Cutorch and Cunn.

In some tutorials, they say that for using the GPU mode with CUDA, there are only some lines of code to add:

require 'cunn' In order to use CUDA

model:cuda() to convert the nn to CUDA

But when I run: qlua train-on-mnist.lua I get some errors. Can you help me ?

Regards.

yyaddaden commented 8 years ago

The problem is resolved, to access to the answer: http://stackoverflow.com/q/36992803/6091401

You can close the issue. Thank you.

amiltonwong commented 8 years ago

For reference, here is the complete code:

----------------------------------------------------------------------
-- This script shows how to train different models on the MNIST 
-- dataset, using multiple optimization techniques (SGD, LBFGS)
--
-- This script demonstrates a classical example of training 
-- well-known models (convnet, MLP, logistic regression)
-- on a 10-class classification problem. 
--
-- It illustrates several points:
-- 1/ description of the model
-- 2/ choice of a loss function (criterion) to minimize
-- 3/ creation of a dataset as a simple Lua table
-- 4/ description of training and test procedures
--
-- Clement Farabet
----------------------------------------------------------------------

require 'torch'
require 'nn'
require 'nnx'
require 'optim'
require 'image'
require 'dataset-mnist'
require 'pl'
require 'paths'
require 'cunn'
require 'cutorch'
----------------------------------------------------------------------
-- parse command-line options
--
local opt = lapp[[
   -s,--save          (default "logs")      subdirectory to save logs
   -n,--network       (default "")          reload pretrained network
   -m,--model         (default "convnet")   type of model tor train: convnet | mlp | linear
   -f,--full                                use the full dataset
   -p,--plot                                plot while training
   -o,--optimization  (default "SGD")       optimization: SGD | LBFGS 
   -r,--learningRate  (default 0.05)        learning rate, for SGD only
   -b,--batchSize     (default 10)          batch size
   -m,--momentum      (default 0)           momentum, for SGD only
   -i,--maxIter       (default 3)           maximum nb of iterations per batch, for LBFGS
   --coefL1           (default 0)           L1 penalty on the weights
   --coefL2           (default 0)           L2 penalty on the weights
   -t,--threads       (default 4)           number of threads
]]

-- fix seed
torch.manualSeed(1)

-- threads
torch.setnumthreads(opt.threads)
print('<torch> set nb of threads to ' .. torch.getnumthreads())

-- use floats, for SGD
if opt.optimization == 'SGD' then
   torch.setdefaulttensortype('torch.FloatTensor')
end

-- batch size?
if opt.optimization == 'LBFGS' and opt.batchSize < 100 then
   error('LBFGS should not be used with small mini-batches; 1000 is recommended')
end

----------------------------------------------------------------------
-- define model to train
-- on the 10-class classification problem
--
classes = {'1','2','3','4','5','6','7','8','9','10'}

-- geometry: width and height of input images
geometry = {32,32}

if opt.network == '' then
   -- define model to train
   model = nn.Sequential()

   if opt.model == 'convnet' then
      ------------------------------------------------------------
      -- convolutional network 
      ------------------------------------------------------------
      -- stage 1 : mean suppresion -> filter bank -> squashing -> max pooling
      model:add(nn.SpatialConvolutionMM(1, 32, 5, 5))
      model:add(nn.Tanh())
      model:add(nn.SpatialMaxPooling(3, 3, 3, 3, 1, 1))
      -- stage 2 : mean suppresion -> filter bank -> squashing -> max pooling
      model:add(nn.SpatialConvolutionMM(32, 64, 5, 5))
      model:add(nn.Tanh())
      model:add(nn.SpatialMaxPooling(2, 2, 2, 2))
      -- stage 3 : standard 2-layer MLP:
      model:add(nn.Reshape(64*3*3))
      model:add(nn.Linear(64*3*3, 200))
      model:add(nn.Tanh())
      model:add(nn.Linear(200, #classes))
      ------------------------------------------------------------

   elseif opt.model == 'mlp' then
      ------------------------------------------------------------
      -- regular 2-layer MLP
      ------------------------------------------------------------
      model:add(nn.Reshape(1024))
      model:add(nn.Linear(1024, 2048))
      model:add(nn.Tanh())
      model:add(nn.Linear(2048,#classes))
      ------------------------------------------------------------

   elseif opt.model == 'linear' then
      ------------------------------------------------------------
      -- simple linear model: logistic regression
      ------------------------------------------------------------
      model:add(nn.Reshape(1024))
      model:add(nn.Linear(1024,#classes))
      ------------------------------------------------------------

   else
      print('Unknown model type')
      cmd:text()
      error()
   end
else
   print('<trainer> reloading previously trained network')
   model = torch.load(opt.network)
end

-- retrieve parameters and gradients
parameters,gradParameters = model:getParameters()

-- verbose
print('<mnist> using model:')
print(model)

----------------------------------------------------------------------
-- loss function: negative log-likelihood
--
model:add(nn.LogSoftMax())
criterion = nn.ClassNLLCriterion()

-- convert into cuda version
model = model:cuda()
-- convert into cuda version
criterion = criterion:cuda()

----------------------------------------------------------------------
-- get/create dataset
--
if opt.full then
   nbTrainingPatches = 60000
   nbTestingPatches = 10000
else
   nbTrainingPatches = 2000
   nbTestingPatches = 1000
   print('<warning> only using 2000 samples to train quickly (use flag -full to use 60000 samples)')
end

-- create training set and normalize
trainData = mnist.loadTrainSet(nbTrainingPatches, geometry)
trainData:normalizeGlobal(mean, std)
trainData.data = trainData.data:cuda()
trainData.labels = trainData.labels:cuda()

-- create test set and normalize
testData = mnist.loadTestSet(nbTestingPatches, geometry)
testData:normalizeGlobal(mean, std)
testData.labels = testData.labels:cuda()

----------------------------------------------------------------------
-- define training and testing functions
--

-- this matrix records the current confusion across classes
confusion = optim.ConfusionMatrix(classes)

-- log results to files
trainLogger = optim.Logger(paths.concat(opt.save, 'train.log'))
testLogger = optim.Logger(paths.concat(opt.save, 'test.log'))

-- training function
function train(dataset)
   -- epoch tracker
   epoch = epoch or 1

   -- local vars
   local time = sys.clock()

   -- do one epoch
   print('<trainer> on training set:')
   print("<trainer> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')
   for t = 1,dataset:size(),opt.batchSize do
      -- create mini batch
      local inputs = torch.Tensor(opt.batchSize,1,geometry[1],geometry[2])
      local targets = torch.Tensor(opt.batchSize)
      local k = 1
      for i = t,math.min(t+opt.batchSize-1,dataset:size()) do
         -- load new sample
         local sample = dataset[i]
         local input = sample[1]:clone()
         local _,target = sample[2]:clone():max(1)
         target = target:squeeze()
         inputs[k] = input
         targets[k] = target
         k = k + 1
      end

      -- create closure to evaluate f(X) and df/dX
      local feval = function(x)
         -- just in case:
         collectgarbage()

         -- get new parameters
         if x ~= parameters then
            parameters:copy(x)
         end

         -- reset gradients
         gradParameters:zero()

         -- evaluate function for complete mini batch
         local outputs = model:forward(inputs)
         local f = criterion:forward(outputs, targets)

         -- estimate df/dW
         local df_do = criterion:backward(outputs, targets)
         model:backward(inputs, df_do)

         -- penalties (L1 and L2):
         if opt.coefL1 ~= 0 or opt.coefL2 ~= 0 then
            -- locals:
            local norm,sign= torch.norm,torch.sign

            -- Loss:
            f = f + opt.coefL1 * norm(parameters,1)
            f = f + opt.coefL2 * norm(parameters,2)^2/2

            -- Gradients:
            gradParameters:add( sign(parameters):mul(opt.coefL1) + parameters:clone():mul(opt.coefL2) )
         end

         -- update confusion
         for i = 1,opt.batchSize do
            confusion:add(outputs[i], targets[i])
         end

         -- return f and df/dX
         return f,gradParameters
      end

      -- optimize on current mini-batch
      if opt.optimization == 'LBFGS' then

         -- Perform LBFGS step:
         lbfgsState = lbfgsState or {
            maxIter = opt.maxIter,
            lineSearch = optim.lswolfe
         }
         optim.lbfgs(feval, parameters, lbfgsState)

         -- disp report:
         print('LBFGS step')
         print(' - progress in batch: ' .. t .. '/' .. dataset:size())
         print(' - nb of iterations: ' .. lbfgsState.nIter)
         print(' - nb of function evalutions: ' .. lbfgsState.funcEval)

      elseif opt.optimization == 'SGD' then

         -- Perform SGD step:
         sgdState = sgdState or {
            learningRate = opt.learningRate,
            momentum = opt.momentum,
            learningRateDecay = 5e-7
         }
         optim.sgd(feval, parameters, sgdState)

         -- disp progress
         xlua.progress(t, dataset:size())

      else
         error('unknown optimization method')
      end
   end

   -- time taken
   time = sys.clock() - time
   time = time / dataset:size()
   print("<trainer> time to learn 1 sample = " .. (time*1000) .. 'ms')

   -- print confusion matrix
   print(confusion)
   trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid * 100}
   confusion:zero()

   -- save/log current net
   local filename = paths.concat(opt.save, 'mnist.net')
   os.execute('mkdir -p ' .. sys.dirname(filename))
   if paths.filep(filename) then
      os.execute('mv ' .. filename .. ' ' .. filename .. '.old')
   end
   print('<trainer> saving network to '..filename)
   -- torch.save(filename, model)

   -- next epoch
   epoch = epoch + 1
end

-- test function
function test(dataset)
   -- local vars
   local time = sys.clock()

   -- test over given dataset
   print('<trainer> on testing Set:')
   for t = 1,dataset:size(),opt.batchSize do
      -- disp progress
      xlua.progress(t, dataset:size())

      -- create mini batch
      local inputs = torch.Tensor(opt.batchSize,1,geometry[1],geometry[2])
      local targets = torch.Tensor(opt.batchSize)
      local k = 1
      for i = t,math.min(t+opt.batchSize-1,dataset:size()) do
         -- load new sample
         local sample = dataset[i]
         local input = sample[1]:clone()
         local _,target = sample[2]:clone():max(1)
         target = target:squeeze()
         inputs[k] = input
         targets[k] = target
         k = k + 1
      end

      -- test samples
      local preds = model:forward(inputs)

      -- confusion:
      for i = 1,opt.batchSize do
         confusion:add(preds[i], targets[i])
      end
   end

   -- timing
   time = sys.clock() - time
   time = time / dataset:size()
   print("<trainer> time to test 1 sample = " .. (time*1000) .. 'ms')

   -- print confusion matrix
   print(confusion)
   testLogger:add{['% mean class accuracy (test set)'] = confusion.totalValid * 100}
   confusion:zero()
end

----------------------------------------------------------------------
-- and train!
--
while true do
   -- train/test
   train(trainData)
   test(testData)

   -- plot errors
   if opt.plot then
      trainLogger:style{['% mean class accuracy (train set)'] = '-'}
      testLogger:style{['% mean class accuracy (test set)'] = '-'}
      trainLogger:plot()
      testLogger:plot()
   end
end

brisker commented 8 years ago

@amiltonwong error when running the code you provided for training mnist: why? ~/torch/install/bin/luajit: bad argument #3 to '?' (torch.*Tensor expected, got userdata) stack traceback: [C]: at 0x7fdfe528df90 [C]: in function '__newindex' train-on-mnist.lua:201: in function 'train' train-on-mnist.lua:358: in main chunk [C]: in function 'dofile' .../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

yasudak commented 7 years ago

-      local inputs = torch.Tensor(opt.batchSize,1,geometry[1],geometry[2])
-      local targets = torch.Tensor(opt.batchSize)
+      local inputs = torch.CudaTensor(opt.batchSize,1,geometry[1],geometry[2])
+      local targets = torch.CudaTensor(opt.batchSize)

In this way it worked.

lebinhe commented 7 years ago

When i running the code from @amiltonwong and @yasudak , error occured.

set nb of threads to 4 using model: nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> output] (1): nn.SpatialConvolutionMM(1 -> 32, 5x5) (2): nn.Tanh (3): nn.SpatialMaxPooling(3x3, 3,3, 1,1) (4): nn.SpatialConvolutionMM(32 -> 64, 5x5) (5): nn.Tanh (6): nn.SpatialMaxPooling(2x2, 2,2) (7): nn.Reshape(576) (8): nn.Linear(576 -> 200) (9): nn.Tanh (10): nn.Linear(200 -> 10) } only using 2000 samples to train quickly (use flag -full to use 60000 samples) loading only 2000 examples done loading only 1000 examples done on training set: online epoch # 1 [batchSize = 10] THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-1543/cutorch/lib/THC/generic/THCTensorMath.cu line=15 error=8 : invalid device function /data/offline/anaconda2/3rd/torch/install/bin/luajit: ...aconda2/3rd/torch/install/share/lua/5.1/nn/Container.lua:67: In 1 module of nn.Sequential: ...ne/anaconda2/3rd/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-1543/cutorch/lib/THC/generic/THCTensorMath.cu:15 stack traceback: [C]: in function 'v' ...ne/anaconda2/3rd/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'SpatialConvolutionMM_updateOutput' .../torch/install/share/lua/5.1/nn/SpatialConvolutionMM.lua:63: in function <.../torch/install/share/lua/5.1/nn/SpatialConvolutionMM.lua:53> [C]: in function 'xpcall' ...aconda2/3rd/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...conda2/3rd/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' train-on-mnist-cuda2.lua:219: in function 'opfunc' .../anaconda2/3rd/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd' train-on-mnist-cuda2.lua:272: in function 'train' train-on-mnist-cuda2.lua:357: in main chunk [C]: in function 'dofile' .../3rd/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x004064f0 WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' ...aconda2/3rd/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' ...conda2/3rd/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' train-on-mnist-cuda2.lua:219: in function 'opfunc' .../anaconda2/3rd/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd' train-on-mnist-cuda2.lua:272: in function 'train' train-on-mnist-cuda2.lua:357: in main chunk [C]: in function 'dofile' .../3rd/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x004064f0

WEldred commented 7 years ago

This code doesn't work for me. I'm seeing the following that tells me that garbage is probably being loaded as the input:

time to learn 1 sample = 1.0377799272537ms =================================>.] ETA: 6ms | Step: 0ms ConfusionMatrix: [[ 1 190 0 0 0 0 0 0 0 0] 0.524% [class: 1] [ 0 220 0 0 0 0 0 0 0 0] 100.000% [class: 2] [ 0 198 0 0 0 0 0 0 0 0] 0.000% [class: 3] [ 0 191 0 0 0 0 0 0 0 0] 0.000% [class: 4] [ 0 214 0 0 0 0 0 0 0 0] 0.000% [class: 5] [ 0 180 0 0 0 0 0 0 0 0] 0.000% [class: 6] [ 0 200 0 0 0 0 0 0 0 0] 0.000% [class: 7] [ 0 224 0 0 0 0 0 0 0 0] 0.000% [class: 8] [ 0 172 0 0 0 0 0 0 0 0] 0.000% [class: 9] [ 0 210 0 0 0 0 0 0 0 0]] 0.000% [class: 10] + average row correct: 10.052356021479%

ocanevet commented 7 years ago

You need to put the following lines

-- retrieve parameters and gradients parameters,gradParameters = model:getParameters()

after the model has been copied to the GPU:

-- convert into cuda version model = model:cuda() -- convert into cuda version criterion = criterion:cuda()

Otherwise the model trains on the GPU and then gets back the initialised model again and again.

torch / demos

Using Cuda with train-a-digit-classifier #43