Closed russellfei closed 10 years ago
@russellfei can you provide a small snippet of your model, along with the input tensor sizes.
Your network's math does not seem to work out, maybe you are providing a gradOutput that is too big
Thx~ @soumith
----------------------------------------------------------------------
function train()
-- epoch tracker
epoch = epoch or 1
-- local vars
local time = sys.clock()
local batchSize = opt.batchSize
-- create augmented dataset
if opt.augment == 'false' then
----> added by r.f.
local totalSize = trainData:size()
-- shuffle at each epoch
trsize = 1680
local shuffle = torch.randperm(trsize):type('torch.LongTensor')
-- BDHW mode
local inputs = torch.Tensor(totalSize,nBands,height,width)
local targets = torch.Tensor(totalSize):zero()
-- shuffle input
inputs = trainData.data:index(1,shuffle)
targets = trainData.labels:index(1,shuffle)
--print('targets of train data')
--print(targets)
-- do one epoch
print('==> doing epoch on training data:')
print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')
for t = 1,totalSize,opt.batchSize do
-- disp progress
xlua.progress(t, totalSize)
-- create mini batch
-------------------------------------------------------
-- the key for use CUDA lies in the support in torch lib
-- not in table, as a result, this code will surely fail.
-- need to add flag 'bmode' for handling with cudaconvnet api
-- TBD
------------------------------------------------------------
local input = inputs[t]
local target = targets[t]
-- evaluate function for complete mini batch
---> get all output at first ---------------
--> error: input is not a floatTensor ???
-- essential data format
if opt.type == 'double' then input = input:double() end
if opt.type == 'cuda' then input = input:cuda() end
-- optimize on current mini-batch
------------------------------------------------------------
-- optim function
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
--print('--> data preparation')
local batchSize = opt.batchSize
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
--print('---> forward propagation')
local outputs = model:forward(input)
outputs = outputs:float()
---> transfer to floatTensor to calculate
-- calculate gradient matrix
local df_do = torch.Tensor(outputs:size())
--print('---> gradients accumulation')
for i = 1,batchSize do
-- estimate f
local err = criterion:forward(outputs, target)
f = f + err
--print('add err 1')
-- estimate df/dW
-- split to calculate df_do
df_do = criterion:backward(outputs,target)
--print('---> backprop')
-- do backwards together
if opt.type == 'cuda' then
model:backward( input,df_do:cuda() )
else
model:backward( input,df_do )
end
-- update confusion
confusion:add(outputs, target)
end
-- normalize gradients and f(X)
gradParameters:div( batchSize )
f = f/batchSize
-- check for convergence at 1st epoch
-- if error doesn't decrease to less than half
-- that model might be diverged.
--print('err: ' .. (f))
-- return f and df/dX
return f,gradParameters
end
--print '------>start to optim'
if optimMethod == optim.asgd then
_,_,average = optimMethod(feval, parameters, optimState)
else
optimMethod(feval, parameters, optimState)
end
end
else
-- augmented inputs and targets
-- store entire augment dataset needs 155G RAM
-- do immediate augment as alternatives
if opt.augment == 'true' then
local bangIdx = 2640
trsize = 1680
local totalSize = bangIdx * trsize
local shuffle = torch.randperm(trsize):type('torch.LongTensor')
-- BDHW mode
local in_inputs = torch.Tensor(trsize,nBands,height,width)
local in_targets = torch.Tensor(trsize):zero()
-- shuffle input
in_inputs = trainData.data:index(1,shuffle)
in_targets = trainData.labels:index(1,shuffle)
-- autmented one image
local inputs = torch.Tensor(bangIdx,nBands,height,width)
local targets = torch.Tensor(bangIdx):zero()
-- do one epoch
print('==> doing epoch on training data:')
print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')
for t = 1,totalSize,opt.batchSize do
-- disp progress
xlua.progress(t, totalSize)
-- augment first image
if (t-1) % bangIdx == 0 then
-- originImageIndex: j
local j = torch.ceil(t/bangIdx)
inputs,targets = dataBang(in_inputs[j],in_targets[j])
end
-- create mini batch
--print('==> map index')
-- related idx for inputs
p_idx = t % bangIdx
--print('p idx = '..p_idx..', t = '..t)
local input = inputs[p_idx]
local target = targets[p_idx]
-- essential data format
if opt.type == 'double' then input = input:double() end
if opt.type == 'cuda' then input = input:cuda() end
------------------------------------------------------------
-- optim function
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
--print('--> data preparation')
local batchSize = opt.batchSize
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
--print('---> forward propagation')
local outputs = model:forward(input)
outputs = outputs:float()
---> transfer to floatTensor to calculate
-- calculate gradient matrix
local df_do = torch.Tensor(outputs:size())
--print('---> gradients accumulation')
for i = 1,batchSize do
-- estimate f
local err = criterion:forward(outputs, target)
f = f + err
--print('add err 1')
-- estimate df/dW
-- split to calculate df_do
df_do = criterion:backward(outputs,target)
--print('---> backprop')
-- do backwards together
if opt.type == 'cuda' then
model:backward( input,df_do:cuda() )
else
model:backward( input,df_do )
end
-- update confusion
confusion:add(outputs, target)
end
-- normalize gradients and f(X)
gradParameters:div( batchSize )
f = f/batchSize
-- check for convergence at 1st epoch
-- if error doesn't decrease to less than half
-- that model might be diverged.
--print('err: ' .. (f))
-- return f and df/dX
return f,gradParameters
end
-- optimize on current mini-batch
--print ('==> start to optim')
if optimMethod == optim.asgd then
_,_,average = optimMethod(feval, parameters, optimState)
else
optimMethod(feval, parameters, optimState)
end
end
else
print 'error at data augment flag value'
end
end
--------end of local optim funciton--------------------------------------
-- time taken
time = sys.clock() - time
time = time / trainData:size()
print("\n==> time to learn 1 sample = " .. (time*1000) .. 'ms')
-- print confusion matrix
print(confusion)
sys.sleep(1)
-- update logger/plot
trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid * 100}
if opt.plot then
trainLogger:style{['% mean class accuracy (train set)'] = '-'}
trainLogger:plot()
end
-- save/log current net
local filename = paths.concat(opt.save, 'model.net')
os.execute('mkdir -p ' .. sys.dirname(filename))
print('==> saving model to '..filename)
torch.save(filename, model)
-- next epoch
confusion:zero()
epoch = epoch + 1
end
In the snippet above, there're two identical feval
function and each time the train()
function process just only one image. opt.augment
is a trigger for create various small images from origin input (3x256x256, sliced into 3x224x224 then resize to 3x112x112)
The model:backward( input, df_do:cuda() )
at the section where opt.augment == 'true'
.
According to source code of model:backward
, it needs input
and adjust the results with df_do
.
The same line works fine, well, why the other line fails? T_T
ok so if one feval is working fine and the other fails. your dataBang function is not giving the correct sized inputs. What you can do is right before the line model:forward, in both locations, print the input sizes, with: print(#input) You will then see that in your second (augment=true) code, the inputs are shaped wrong by dataBang. (at least I suspect this)
Morning~ @soumith
Well, I tried before, the input
size is 3x256x256 when augment == "false"
and 3x112x112 when augment == "true"
, they are actually feed into different network architectures which is listed below.
model = nn.Sequential()
if opt.model == 'convnet' then
-- input dimensions
if opt.augment == 'true' then
nBands = 3
width = 112
height = 112
--TODO: specify augmented cnn arch
hidConv = {96,128,256,384,512,768,210}
filtsize = {5,5,3,3,3,3}
poolsize = {2,0,3,0,4,0}
-- stage 1 : filter bank -> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(nBands, hidConv[1], filtsize[1], filtsize[1]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[1],2,poolsize[1],poolsize[1],poolsize[1],poolsize[1]))
-- stage 2 : filter bank -> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[1], hidConv[2], filtsize[2], filtsize[2]))
model:add(nn.ReLU())
--model:add(nn.SpatialLPPooling(hidConv[2],2,poolsize[2],poolsize[2],poolsize[2],poolsize[2]))
-- stage 3: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[2], hidConv[3], filtsize[3], filtsize[3]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[3],poolsize[3],poolsize[3],poolsize[3],poolsize[3]))
-- stage 4: filter bank --> nonlinear --> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[3], hidConv[4], filtsize[4], filtsize[4]))
model:add(nn.ReLU())
--model:add(nn.SpatialLPPooling(hidConv[4],poolsize[4],poolsize[4],poolsize[4],poolsize[4]))
-- stage 5: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[4], hidConv[5], filtsize[5], filtsize[5]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[5],poolsize[5],poolsize[5],poolsize[5],poolsize[5]))
-- stage 6: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))
model:add(nn.ReLU())
--model:add(nn.SpatialLPPooling(hidConv[6],poolsize[6],poolsize[6],poolsize[6],poolsize[6]))
-- stage 6 : standard 2-layer neural network
model:add(nn.Reshape(hidConv[6]))
model:add(nn.Linear(hidConv[6], hidConv[7]))
model:add(nn.Tanh())
model:add(nn.Linear(hidConv[7], noutputs))
else
if opt.augment == 'false' then
nBands = 3
width = 256
height = 256
-- hidden units, filter sizes (for ConvNet only):
hidConv = {128,256,384,512,768,768,210}
filtsize = {5,7,5,5,3,3}
poolsize = {2,2,2,2,2,3}
-- stage 1 : filter bank -> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(nBands, hidConv[1], filtsize[1], filtsize[1]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[1],2,poolsize[1],poolsize[1],poolsize[1],poolsize[1]))
-- stage 2 : filter bank -> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[1], hidConv[2], filtsize[2], filtsize[2]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[2],2,poolsize[2],poolsize[2],poolsize[2],poolsize[2]))
-- stage 3: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[2], hidConv[3], filtsize[3], filtsize[3]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[3],poolsize[3],poolsize[3],poolsize[3],poolsize[3]))
-- stage 4: filter bank --> nonlinear --> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[3], hidConv[4], filtsize[4], filtsize[4]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[4],poolsize[4],poolsize[4],poolsize[4],poolsize[4]))
-- stage 5: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[4], hidConv[5], filtsize[5], filtsize[5]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[5],poolsize[5],poolsize[5],poolsize[5],poolsize[5]))
-- stage 6: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[6],poolsize[6],poolsize[6],poolsize[6],poolsize[6]))
-- stage 6 : standard 2-layer neural network
model:add(nn.Reshape(hidConv[6]))
model:add(nn.Linear(hidConv[6], hidConv[7]))
model:add(nn.Tanh())
model:add(nn.Linear(hidConv[7], noutputs))
end
end
end
The forward process is identical, because only one image pass and back at one time. Besides, I've checked my net architecture parameters and they are congruent with my proposed computation, i.e., the output is 1x1 for each feature map.
I'll check it again.
the network looks fine, however, i am saying check that your dataBang function always gives out 112x112 cases. When doing random crops, you might hit a corner case somewhere. Add this line right before the forward call: print(#input) See that for every sample it is the exact same input size, and databang is not sometimes returning 112x111 for example.
@soumith However, this error is asserted when the first image is backproped.
and the input size for that first image is
input............................. 1/4435200 ..................................] ETA: 0ms | Step: 0ms
3
112
112
[torch.LongStorage of size 3]
df_do
21
[torch.LongStorage of size 1]
using this snippet
print()
print('input')
print(#input)
--print('---> forward propagation')
local outputs = model:forward(input)
outputs = outputs:float()
---> transfer to floatTensor to calculate
-- calculate gradient matrix
local df_do = torch.Tensor(outputs:size())
print('df_do')
print(#df_do)
Maybe df_do
is not strictly defined?
But I also checked the input
and df_do
when augment == "false"
, both cases are congruent
df_do should be equal to noutputs afaik.
Also, try replacing the LPPooling with MaxPooling and see if that works. just to be sure something funky is not going on with LPPooling
Thanks @soumith I'll check that, it's a really strange error. Night ;-)
Genius!!! @soumith it works! Thanks again! I've been struggling with this error for almost 20h in 2 days
Notes:
I've noticed that the nn.Power
module is a part of SpatialLPPooling
using print(model)
, maybe I should read more lower implements
what was the solution?
Maybe there's something wrong when I call 'SpatialLPPooling` I'll figure out that. ;-)
I changed SpatitalMaxPooling
back to SpatialLPPooling
and got error message like this
==> defining some tools
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output]
(1): nn.SpatialConvolutionMM
(2): nn.ReLU
(3): nn.Sequential {
[input -> (1) -> (2) -> (3) -> output]
(1): nn.Square
(2): nn.SpatialSubSampling
(3): nn.Sqrt
}
(4): nn.SpatialConvolutionMM
(5): nn.ReLU
(6): nn.SpatialConvolutionMM
(7): nn.ReLU
(8): nn.Sequential {
[input -> (1) -> (2) -> (3) -> output]
(1): nn.Square
(2): nn.SpatialSubSampling
(3): nn.Sqrt
}
(9): nn.SpatialConvolutionMM
(10): nn.ReLU
(11): nn.SpatialConvolutionMM
(12): nn.ReLU
(13): nn.Sequential {
[input -> (1) -> (2) -> (3) -> output]
(1): nn.Square
(2): nn.SpatialSubSampling
(3): nn.Sqrt
}
(14): nn.SpatialConvolutionMM
(15): nn.ReLU
(16): nn.Reshape
(17): nn.Linear
(18): nn.Tanh
(19): nn.Linear
(20): nn.LogSoftMax
}
==> configuring optimizer
==> training!
==> doing epoch on training data:
==> online epoch # 1 [batchSize = 1]
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Sequential.lua:37: size mismatchA: 0ms | Step: 0ms
stack traceback:
[C]: in function 'updateOutput'
/usr/local/share/lua/5.1/nn/Sequential.lua:37: in function 'forward'
ucmcnn_aug_LP.lua:939: in function 'opfunc'
/usr/local/share/lua/5.1/optim/sgd.lua:40: in function 'optimMethod'
ucmcnn_aug_LP.lua:979: in function 'train'
ucmcnn_aug_LP.lua:1166: in main chunk
[C]: in function 'dofile'
/usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:109: in main chunk
[C]: at 0x00404480
Well, there is a block of code in SpatialLPPooling
like
if pnorm == 2 then
self:add(nn.Square())
else
self:add(nn.Power(pnorm))
end
self:add(nn.SpatialSubSampling(nInputPlane, kW, kH, dW, dH))
if pnorm == 2 then
self:add(nn.Sqrt())
else
self:add(nn.Power(1/pnorm))
end
self:get(2).bias:zero()
self:get(2).weight:fill(1)
I think there's some rule to follow when SpatialLPPooling
is called.
BTW, the 'nn.powererror is due to the missing of
pnormduring later
SpatialLPPooling` definitions.
In short, here's something we have to penetrate in.
Too tired to continue, see you~
The bug has been caught, a really little bug!
-- stage 6: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))
should be
-- stage 6: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[5], hidConv[6], filtsize[6], filtsize[6]))
However, during these past hours, I've noted another weird thing and I'll issue a new bug for this.
Thanks~ @soumith
According to the contribution regulations of torch, please delete this issue because it is a personal help request which should be posted on mailing list (the google group, which I often have no access to), thanks @soumith
Hi, all~
Currently, I'm plug
nn
modules throughSequential
container. My NN script is adapted from @soumith /galaxyzoo for CUDA usage Everything works fine, however, this error message is quite confusing, I've checkedSequential.lua
(while found thatmodel:backward(input,df_do:cud()
is related toModule.la
andPower.lua
later). There are two identical code snippets in my scripts and one of them works fine, another don't. Can anyone help me figure out this ?BTW, some functions in
Module.lua
just do nothing about input parameters, are those parameters cleared whenzeroParameters()
is called?Thanks~