Need help for backward training

russellfei commented 10 years ago

Hi, all~

Currently, I'm plug nn modules through Sequential container. My NN script is adapted from @soumith /galaxyzoo for CUDA usage Everything works fine, however, this error message is quite confusing, I've checked Sequential.lua (while found that model:backward(input,df_do:cud() is related to Module.la and Power.lua later). There are two identical code snippets in my scripts and one of them works fine, another don't. Can anyone help me figure out this ?

BTW, some functions in Module.lua just do nothing about input parameters, are those parameters cleared when zeroParameters() is called?

/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Power.lua:18: bad argument #1 to 'copy' (sizes do not match)                    
stack traceback:
    [C]: in function 'copy'
    /usr/local/share/lua/5.1/nn/Power.lua:18: in function 'updateGradInput'
    /usr/local/share/lua/5.1/nn/Sequential.lua:48: in function 'updateGradInput'
    /usr/local/share/lua/5.1/nn/Sequential.lua:48: in function 'updateGradInput'
    /usr/local/share/lua/5.1/nn/Module.lua:30: in function 'backward'
    <my_persional_lua_script>.lua:1029: in function 'opfunc'
    /usr/local/share/lua/5.1/optim/sgd.lua:40: in function 'optimMethod'

Thanks~

soumith commented 10 years ago

@russellfei can you provide a small snippet of your model, along with the input tensor sizes.

Your network's math does not seem to work out, maybe you are providing a gradOutput that is too big

russellfei commented 10 years ago

Thx~ @soumith

----------------------------------------------------------------------
function train()

   -- epoch tracker
   epoch = epoch or 1

   -- local vars
   local time = sys.clock()
   local batchSize = opt.batchSize

   -- create augmented dataset
   if opt.augment == 'false' then
      ----> added by r.f.
      local totalSize = trainData:size()
      -- shuffle at each epoch
      trsize = 1680
      local shuffle = torch.randperm(trsize):type('torch.LongTensor')
      -- BDHW mode
      local inputs = torch.Tensor(totalSize,nBands,height,width)
      local targets = torch.Tensor(totalSize):zero()
      -- shuffle input
      inputs = trainData.data:index(1,shuffle)
      targets = trainData.labels:index(1,shuffle)
      --print('targets of train data')
      --print(targets)
      -- do one epoch
      print('==> doing epoch on training data:')
      print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')

      for t = 1,totalSize,opt.batchSize do
         -- disp progress
         xlua.progress(t, totalSize)

         -- create mini batch
         -------------------------------------------------------
         -- the key for use CUDA lies in the support in torch lib
         -- not in table, as a result, this code will surely fail.
         -- need to add flag 'bmode' for handling with cudaconvnet api
         -- TBD
         ------------------------------------------------------------
         local input = inputs[t]
         local target = targets[t]
         -- evaluate function for complete mini batch
         ---> get all output at first ---------------
         --> error: input is not a floatTensor ???
         -- essential data format
         if opt.type == 'double' then input = input:double() end
         if opt.type == 'cuda' then input = input:cuda() end
         -- optimize on current mini-batch
         ------------------------------------------------------------
         -- optim function
         -- create closure to evaluate f(X) and df/dX
         local feval = function(x)

            --print('--> data preparation')
            local batchSize = opt.batchSize
            -- get new parameters
            if x ~= parameters then
               parameters:copy(x)
            end
            -- reset gradients
            gradParameters:zero()

            -- f is the average of all criterions
            local f = 0

            --print('---> forward propagation')
            local outputs = model:forward(input)
            outputs = outputs:float()
            ---> transfer to floatTensor to calculate
            -- calculate gradient matrix
            local df_do = torch.Tensor(outputs:size())

            --print('---> gradients accumulation')
            for i = 1,batchSize do
               -- estimate f
               local err = criterion:forward(outputs, target)
               f = f + err
               --print('add err 1')
               -- estimate df/dW
               -- split to calculate df_do
               df_do = criterion:backward(outputs,target)
               --print('---> backprop')
               -- do backwards together
               if opt.type == 'cuda' then
                  model:backward( input,df_do:cuda() )
               else
                  model:backward( input,df_do )
               end
               -- update confusion
               confusion:add(outputs, target)
            end

            -- normalize gradients and f(X)
            gradParameters:div( batchSize )
            f = f/batchSize
            -- check for convergence at 1st epoch
            -- if error doesn't decrease to less than half
            -- that model might be diverged.
            --print('err: ' .. (f))
            -- return f and df/dX
            return f,gradParameters
         end
         --print '------>start to optim'
         if optimMethod == optim.asgd then
            _,_,average = optimMethod(feval, parameters, optimState)
         else
            optimMethod(feval, parameters, optimState)
         end
      end
   else
      -- augmented inputs and targets
      -- store entire augment dataset needs 155G RAM
      -- do immediate augment as alternatives
      if opt.augment == 'true' then
         local bangIdx = 2640
         trsize = 1680
         local totalSize = bangIdx * trsize
         local shuffle = torch.randperm(trsize):type('torch.LongTensor')
         -- BDHW mode
         local in_inputs = torch.Tensor(trsize,nBands,height,width)
         local in_targets = torch.Tensor(trsize):zero()
         -- shuffle input
         in_inputs = trainData.data:index(1,shuffle)
         in_targets = trainData.labels:index(1,shuffle)
         -- autmented one image
         local inputs = torch.Tensor(bangIdx,nBands,height,width)
         local targets = torch.Tensor(bangIdx):zero()

         -- do one epoch
         print('==> doing epoch on training data:')
         print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')

         for t = 1,totalSize,opt.batchSize do
            -- disp progress
            xlua.progress(t, totalSize)

            -- augment first image
            if  (t-1) % bangIdx == 0 then
               -- originImageIndex: j
               local j = torch.ceil(t/bangIdx)
               inputs,targets = dataBang(in_inputs[j],in_targets[j])
            end
            -- create mini batch
            --print('==> map index')
            -- related idx for inputs
            p_idx = t % bangIdx
            --print('p idx = '..p_idx..', t = '..t)
            local input = inputs[p_idx]
            local target = targets[p_idx]

            -- essential data format
            if opt.type == 'double' then input = input:double() end
            if opt.type == 'cuda' then input = input:cuda() end
            ------------------------------------------------------------
            -- optim function
            -- create closure to evaluate f(X) and df/dX
            local feval = function(x)

               --print('--> data preparation')
               local batchSize = opt.batchSize
               -- get new parameters
               if x ~= parameters then
                  parameters:copy(x)
               end
               -- reset gradients
               gradParameters:zero()

               -- f is the average of all criterions
               local f = 0

               --print('---> forward propagation')
               local outputs = model:forward(input)
               outputs = outputs:float()
               ---> transfer to floatTensor to calculate
               -- calculate gradient matrix
               local df_do = torch.Tensor(outputs:size())

               --print('---> gradients accumulation')
               for i = 1,batchSize do
                  -- estimate f
                  local err = criterion:forward(outputs, target)
                  f = f + err
                  --print('add err 1')
                  -- estimate df/dW
                  -- split to calculate df_do
                  df_do = criterion:backward(outputs,target)
                  --print('---> backprop')
                  -- do backwards together
                  if opt.type == 'cuda' then
                     model:backward( input,df_do:cuda() )
                  else
                     model:backward( input,df_do )
                  end
                  -- update confusion
                  confusion:add(outputs, target)
               end

               -- normalize gradients and f(X)
               gradParameters:div( batchSize )
               f = f/batchSize
               -- check for convergence at 1st epoch
               -- if error doesn't decrease to less than half
               -- that model might be diverged.
               --print('err: ' .. (f))
               -- return f and df/dX
               return f,gradParameters
            end

            -- optimize on current mini-batch
            --print ('==> start to optim')
            if optimMethod == optim.asgd then
               _,_,average = optimMethod(feval, parameters, optimState)
            else
               optimMethod(feval, parameters, optimState)
            end
         end
      else
         print 'error at data augment flag value'
      end
   end

   --------end of local optim funciton--------------------------------------
   -- time taken
   time = sys.clock() - time
   time = time / trainData:size()
   print("\n==> time to learn 1 sample = " .. (time*1000) .. 'ms')

   -- print confusion matrix
   print(confusion)
   sys.sleep(1)
   -- update logger/plot
   trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid * 100}
   if opt.plot then
      trainLogger:style{['% mean class accuracy (train set)'] = '-'}
      trainLogger:plot()
   end

   -- save/log current net
   local filename = paths.concat(opt.save, 'model.net')
   os.execute('mkdir -p ' .. sys.dirname(filename))
   print('==> saving model to '..filename)
   torch.save(filename, model)

   -- next epoch
   confusion:zero()
   epoch = epoch + 1
end

In the snippet above, there're two identical feval function and each time the train() function process just only one image. opt.augment is a trigger for create various small images from origin input (3x256x256, sliced into 3x224x224 then resize to 3x112x112)

The model:backward( input, df_do:cuda() ) at the section where opt.augment == 'true'.

According to source code of model:backward, it needs input and adjust the results with df_do. The same line works fine, well, why the other line fails? T_T

soumith commented 10 years ago

ok so if one feval is working fine and the other fails. your dataBang function is not giving the correct sized inputs. What you can do is right before the line model:forward, in both locations, print the input sizes, with: print(#input) You will then see that in your second (augment=true) code, the inputs are shaped wrong by dataBang. (at least I suspect this)

russellfei commented 10 years ago

Morning~ @soumith Well, I tried before, the input size is 3x256x256 when augment == "false" and 3x112x112 when augment == "true", they are actually feed into different network architectures which is listed below.

model = nn.Sequential()

if opt.model == 'convnet' then
   -- input dimensions
   if opt.augment == 'true' then
      nBands = 3
      width = 112
      height = 112
      --TODO: specify augmented cnn arch
      hidConv = {96,128,256,384,512,768,210}
      filtsize = {5,5,3,3,3,3}
      poolsize = {2,0,3,0,4,0}

      -- stage 1 : filter bank -> nonlinear -> L2 pooling
      model:add(nn.SpatialConvolutionMM(nBands, hidConv[1], filtsize[1], filtsize[1]))
      model:add(nn.ReLU())
      model:add(nn.SpatialLPPooling(hidConv[1],2,poolsize[1],poolsize[1],poolsize[1],poolsize[1]))
      -- stage 2 : filter bank -> nonlinear -> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[1], hidConv[2], filtsize[2], filtsize[2]))
      model:add(nn.ReLU())
      --model:add(nn.SpatialLPPooling(hidConv[2],2,poolsize[2],poolsize[2],poolsize[2],poolsize[2]))
      -- stage 3: filter bank --> nonlinear  -> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[2], hidConv[3], filtsize[3], filtsize[3]))
      model:add(nn.ReLU())
      model:add(nn.SpatialLPPooling(hidConv[3],poolsize[3],poolsize[3],poolsize[3],poolsize[3]))

      -- stage 4: filter bank --> nonlinear --> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[3], hidConv[4], filtsize[4], filtsize[4]))
      model:add(nn.ReLU())
      --model:add(nn.SpatialLPPooling(hidConv[4],poolsize[4],poolsize[4],poolsize[4],poolsize[4]))

      -- stage 5: filter bank --> nonlinear -> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[4], hidConv[5], filtsize[5], filtsize[5]))
      model:add(nn.ReLU())
      model:add(nn.SpatialLPPooling(hidConv[5],poolsize[5],poolsize[5],poolsize[5],poolsize[5]))

      -- stage 6: filter bank --> nonlinear -> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))
      model:add(nn.ReLU())
      --model:add(nn.SpatialLPPooling(hidConv[6],poolsize[6],poolsize[6],poolsize[6],poolsize[6]))

      -- stage 6 : standard 2-layer neural network
      model:add(nn.Reshape(hidConv[6]))
      model:add(nn.Linear(hidConv[6], hidConv[7]))
      model:add(nn.Tanh())
      model:add(nn.Linear(hidConv[7], noutputs))
   else
      if opt.augment == 'false' then
         nBands = 3
         width = 256
         height = 256
         -- hidden units, filter sizes (for ConvNet only):
         hidConv = {128,256,384,512,768,768,210}
         filtsize = {5,7,5,5,3,3}
         poolsize = {2,2,2,2,2,3}

         -- stage 1 : filter bank -> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(nBands, hidConv[1], filtsize[1], filtsize[1]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[1],2,poolsize[1],poolsize[1],poolsize[1],poolsize[1]))
         -- stage 2 : filter bank -> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[1], hidConv[2], filtsize[2], filtsize[2]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[2],2,poolsize[2],poolsize[2],poolsize[2],poolsize[2]))
         -- stage 3: filter bank --> nonlinear  -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[2], hidConv[3], filtsize[3], filtsize[3]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[3],poolsize[3],poolsize[3],poolsize[3],poolsize[3]))

         -- stage 4: filter bank --> nonlinear --> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[3], hidConv[4], filtsize[4], filtsize[4]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[4],poolsize[4],poolsize[4],poolsize[4],poolsize[4]))

         -- stage 5: filter bank --> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[4], hidConv[5], filtsize[5], filtsize[5]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[5],poolsize[5],poolsize[5],poolsize[5],poolsize[5]))

         -- stage 6: filter bank --> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[6],poolsize[6],poolsize[6],poolsize[6],poolsize[6]))

         -- stage 6 : standard 2-layer neural network
         model:add(nn.Reshape(hidConv[6]))
         model:add(nn.Linear(hidConv[6], hidConv[7]))
         model:add(nn.Tanh())
         model:add(nn.Linear(hidConv[7], noutputs))
      end
   end
end

The forward process is identical, because only one image pass and back at one time. Besides, I've checked my net architecture parameters and they are congruent with my proposed computation, i.e., the output is 1x1 for each feature map.

I'll check it again.

soumith commented 10 years ago

the network looks fine, however, i am saying check that your dataBang function always gives out 112x112 cases. When doing random crops, you might hit a corner case somewhere. Add this line right before the forward call: print(#input) See that for every sample it is the exact same input size, and databang is not sometimes returning 112x111 for example.

russellfei commented 10 years ago

@soumith However, this error is asserted when the first image is backproped.

and the input size for that first image is

input............................. 1/4435200 ..................................] ETA: 0ms | Step: 0ms                              
   3
 112
 112
[torch.LongStorage of size 3]

df_do   

 21
[torch.LongStorage of size 1]

using this snippet

            print()
            print('input')
            print(#input)
            --print('---> forward propagation')
            local outputs = model:forward(input)
            outputs = outputs:float()
            ---> transfer to floatTensor to calculate
            -- calculate gradient matrix 
           local df_do = torch.Tensor(outputs:size())
            print('df_do')
            print(#df_do)

Maybe df_do is not strictly defined?

But I also checked the input and df_do when augment == "false", both cases are congruent

soumith commented 10 years ago

df_do should be equal to noutputs afaik.

soumith commented 10 years ago

Also, try replacing the LPPooling with MaxPooling and see if that works. just to be sure something funky is not going on with LPPooling

russellfei commented 10 years ago

Thanks @soumith I'll check that, it's a really strange error. Night ;-)

russellfei commented 10 years ago

Genius!!! @soumith it works! Thanks again! I've been struggling with this error for almost 20h in 2 days

Notes: I've noticed that the nn.Power module is a part of SpatialLPPooling using print(model), maybe I should read more lower implements

soumith commented 10 years ago

what was the solution?

russellfei commented 10 years ago

Maybe there's something wrong when I call 'SpatialLPPooling` I'll figure out that. ;-)

russellfei commented 10 years ago

I changed SpatitalMaxPooling back to SpatialLPPooling and got error message like this

==> defining some tools 
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output]
  (1): nn.SpatialConvolutionMM
  (2): nn.ReLU
  (3): nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): nn.Square
    (2): nn.SpatialSubSampling
    (3): nn.Sqrt
  }
  (4): nn.SpatialConvolutionMM
  (5): nn.ReLU
  (6): nn.SpatialConvolutionMM
  (7): nn.ReLU
  (8): nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): nn.Square
    (2): nn.SpatialSubSampling
    (3): nn.Sqrt
  }
  (9): nn.SpatialConvolutionMM
  (10): nn.ReLU
  (11): nn.SpatialConvolutionMM
  (12): nn.ReLU
  (13): nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): nn.Square
    (2): nn.SpatialSubSampling
    (3): nn.Sqrt
  }
  (14): nn.SpatialConvolutionMM
  (15): nn.ReLU
  (16): nn.Reshape
  (17): nn.Linear
  (18): nn.Tanh
  (19): nn.Linear
  (20): nn.LogSoftMax
}
==> configuring optimizer   
==> training!   
==> doing epoch on training data:   
==> online epoch # 1 [batchSize = 1]    
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Sequential.lua:37: size mismatchA: 0ms | Step: 0ms                              
stack traceback:
    [C]: in function 'updateOutput'
    /usr/local/share/lua/5.1/nn/Sequential.lua:37: in function 'forward'
    ucmcnn_aug_LP.lua:939: in function 'opfunc'
    /usr/local/share/lua/5.1/optim/sgd.lua:40: in function 'optimMethod'
    ucmcnn_aug_LP.lua:979: in function 'train'
    ucmcnn_aug_LP.lua:1166: in main chunk
    [C]: in function 'dofile'
    /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:109: in main chunk
    [C]: at 0x00404480

Well, there is a block of code in SpatialLPPooling like

   if pnorm == 2 then
      self:add(nn.Square())
   else
      self:add(nn.Power(pnorm))
   end
   self:add(nn.SpatialSubSampling(nInputPlane, kW, kH, dW, dH))
   if pnorm == 2 then
      self:add(nn.Sqrt())
   else
      self:add(nn.Power(1/pnorm))
   end

   self:get(2).bias:zero()
   self:get(2).weight:fill(1)

I think there's some rule to follow when SpatialLPPooling is called. BTW, the 'nn.powererror is due to the missing ofpnormduring laterSpatialLPPooling` definitions.

In short, here's something we have to penetrate in.

Too tired to continue, see you~

russellfei commented 10 years ago

The bug has been caught, a really little bug!

         -- stage 6: filter bank --> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))

should be

         -- stage 6: filter bank --> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[5], hidConv[6], filtsize[6], filtsize[6]))

However, during these past hours, I've noted another weird thing and I'll issue a new bug for this.

Thanks~ @soumith

russellfei commented 10 years ago

According to the contribution regulations of torch, please delete this issue because it is a personal help request which should be posted on mailing list (the google group, which I often have no access to), thanks @soumith

torch / nn

Need help for backward training #91