Optim does not update weights on big MLP network

viktorheli commented 7 years ago

Hi! I'm sorry, but i write issue in Optim repo, but has no got answer. :( I have a following problem:

I try training network for regression task with optim.sgd. But I see strange thing. If I add to network > 12-16 layers to my MLP, optim does not change weights and network does not learning. Network begin learning if I decrease number of layers. But in strange cases network with 16 layers begun learning with learning rate 2 or above. Network with 24 layers does not learning with learning rate 100 or above.

This behavior of "optim" very strange for me. But maybe I do not understanding simple things.

This MLP not learning (epoch loss not changed during training) because maybe BUG in Optim:

    mlp = nn.Sequential()
    mlp:add(nn.Linear(28, 56))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(56, 58))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(58, 112))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(112, 114))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(114, 224))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(224, 226))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(226, 448))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(448, 450))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(450, 224))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(224, 112))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(112, 56))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(56, 7))
    mlp:add(nn.Tanh())

This mlp learning with learningrate 2:

    mlp = nn.Sequential()
    mlp:add(nn.Linear(28, 56))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(56, 58))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(58, 112))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(112, 224))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(224, 224))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(224, 112))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(112, 56))
    mlp:add(nn.Sigmoid())
    mlp:add(nn.Linear(56, 7))
    mlp:add(nn.Tanh())

My programm code:

require ('torchx')
require ('paths')
cjson = require 'cjson'
require 'io'
require 'nn'
require 'optim'
require 'cunn'
require 'cutorch'

torch.setdefaulttensortype('torch.FloatTensor')

--cmd line arg
cmd = torch.CmdLine()
cmd:text()
cmd:text('Training neural networks. By default train neural network for 24 hour prediction')
cmd:text('Example:')
cmd:text('$> th pattern-make-train-optim.lua -dataset "path to dataset" -storenet "path to store you net" -saveevery 1000')
cmd:text('All options:')
cmd:option('-dataset', 'simple-bug-dataset.t7', 'Path to load dataset')
cmd:option('-storenet', 'simple-bug.dat', 'Path to saving or loading neuralnet')
cmd:option('-train', '2000', 'Numbers of train iterations')
cmd:option('-learningrate', '0.01', 'learning rate for SGD algorithm')
cmd:option('-saveevery', '100000', 'Save temporal net every N "epoch's"')
cmd:option('-valid', '200', 'Do validation on dataset every N epochs and display min max and average error')
cmd:option('-progress', 'yes', 'Display xlua progress bar "yes" or "no"')
cmd:option('-momentum', '0', 'Momentum for changing learningrate')
opt = cmd:parse(arg or {})

--calculate error on dataset. validation function (not real validation)

function validation()

    dsize = dataset.inputs:size(1)                                                                                                                                                                 
    errormatrix = {}                                                                                                                                                                               

    for i = 1, dsize/10 do                                                                                                                                                                         

            permutation = torch.random(dsize)                                                                                                                                                      

            fwd = mlp:forward(dataset.inputs[permutation])                                                                                                                                         
            predict = (fwd)  
            real = (dataset.outputs[permutation])
            erorrpercent = math.abs((((predict[1]/real[1])-1)*100))

            table.insert(errormatrix, erorrpercent)

    end
    min = torch.min(torch.Tensor(errormatrix))
    max = torch.max(torch.Tensor(errormatrix))
    mean = torch.mean(torch.Tensor(errormatrix))

    print("\n".."Min error, %: "..min.."\n".."Max error, %:  "..max.."\n".."Average error, %: "..mean.."\n")

end

if (paths.filep(opt.storenet) == true) then

            print("Loading net file:        "..opt.storenet)
            mlp = torch.load(opt.storenet)

    else

            print("Creating net for traning")

--This MLP not learning because maybe BUG in optim
mlp = nn.Sequential()
mlp:add(nn.Linear(28, 56))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(56, 58))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(58, 112))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(112, 114))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(114, 224))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(224, 226))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(226, 448))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(448, 450))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(450, 224))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(224, 112))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(112, 56))
mlp:add(nn.Sigmoid())
mlp:add(nn.Linear(56, 7))
mlp:add(nn.Tanh())

--[[
--This mlp learning with learningrate 2

            mlp = nn.Sequential()
            mlp:add(nn.Linear(28, 56))
            mlp:add(nn.Sigmoid())
            mlp:add(nn.Linear(56, 58))
            mlp:add(nn.Sigmoid())
            mlp:add(nn.Linear(58, 112))
            mlp:add(nn.Sigmoid())
            mlp:add(nn.Linear(112, 224))
            mlp:add(nn.Sigmoid())
            mlp:add(nn.Linear(224, 224))
            mlp:add(nn.Sigmoid())
            mlp:add(nn.Linear(224, 112))
            mlp:add(nn.Sigmoid())
            mlp:add(nn.Linear(112, 56))
            mlp:add(nn.Sigmoid())
            mlp:add(nn.Linear(56, 7))
            mlp:add(nn.Tanh())

--]]
print (mlp)

end --this end for if for mlp

dataset = torch.load(opt.dataset)

criterion = nn.MSECriterion()
params, gradParams = mlp:getParameters()
optimState = {learningRate = opt.learningrate, momentum = opt.momentum}

for epoch = 1, opt.train do
if (opt.progress == "yes" ) then

            xlua.progress(epoch, opt.train)
    end
    function feval(params)
            gradParams:zero()
            outputs = mlp:forward(dataset.inputs)
            loss = criterion:forward(outputs, dataset.outputs)
            dloss_doutputs = criterion:backward(outputs, dataset.outputs)
            mlp:backward(dataset.inputs, dloss_doutputs)
            return loss, gradParams
    end

    fs = optim.sgd(feval, params, optimState)

    if  epoch % opt.saveevery  == 0 then
            print("Number of iteration: "..epoch)
            print("Saving nempotary model to: "..opt.storenet.."temporal")
            torch.save(opt.storenet.."temporal", mlp)

    end

    if  epoch % opt.valid  == 0 then

-- validation()
epochloss = fs[1] / dataset.outputs:size(1)
print("\n"..epochloss*1000)
end
end

print("Saving model to: "..opt.storenet)
torch.save(opt.storenet, mlp)

For example: th simple-bug.lua -valid 20 -train 100000 -learningrate 2 -progress no Creating net for traning nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> output] (1): nn.Linear(28 -> 56) (2): nn.Sigmoid (3): nn.Linear(56 -> 58) (4): nn.Sigmoid (5): nn.Linear(58 -> 112) (6): nn.Sigmoid (7): nn.Linear(112 -> 114) (8): nn.Sigmoid (9): nn.Linear(114 -> 224) (10): nn.Sigmoid (11): nn.Linear(224 -> 226) (12): nn.Sigmoid (13): nn.Linear(226 -> 448) (14): nn.Sigmoid (15): nn.Linear(448 -> 450) (16): nn.Sigmoid (17): nn.Linear(450 -> 224) (18): nn.Sigmoid (19): nn.Linear(224 -> 112) (20): nn.Sigmoid (21): nn.Linear(112 -> 56) (22): nn.Sigmoid (23): nn.Linear(56 -> 7) (24): nn.Tanh }

Number of iteration: 20 Epochloss: -0.83493030276792

Number of iteration: 40 Epochloss: -0.83493030276792

Number of iteration: 60 Epochloss: -0.83493030276792

Number of iteration: 80 Epochloss: -0.83493030276792

Number of iteration: 100 Epochloss: -0.83493030276792

Number of iteration: 120 Epochloss: -0.83493030276792

Number of iteration: 140 Epochloss: -0.83493030276792

Number of iteration: 160 Epochloss: -0.83493030276792

Number of iteration: 180 Epochloss: -0.83493030276792

Number of iteration: 200 Epochloss: -0.83493030276792

As you can see, Epochloss does not change absolutely.

Tried to wait 24 hours. But the result does not change. If I reduce the network, then it starts to learn.

Dataset for test: https://www.dropbox.com/s/deom263k4zk14ur/simple-bug-dataset.t7?dl=0

Big thanks for help.

ProGamerGov commented 7 years ago

Link to the issue on the optim repository: https://github.com/torch/optim/issues/164

viktorheli commented 7 years ago

ProGamerGov, hi! I worked with images (see my repo), but my tasks does not need big imageges. In my case Optim worked fine, despite the fact that the network takes up 7 GB of memory on GPU. I worked with convolution networks. In this case, from this issue, is a trivial MLP, and on big net I have a problem with Optim. Yesterday I tryed to train my network, from example described above, with nn sgd. All worked perfecly - epoch loss changed every iteration. We have a strange problem.

ProGamerGov commented 7 years ago

@viktorheli If we do have a similar problem or even the same problem, then it's not your data set that's the issue. The issue for me occurred with adam, and lbfgs. But I also use the MSECriterion and the various tensor manipulation tools. What packages are you using? Are you doing anything similar to me?

I added a bunch of debugging code to my project, which stored and printed the gradient every iteration, and various important various in between the main gradient changes, and that let me narrow down the suspects for the issue. Have you tried doing this yet?

ProGamerGov commented 7 years ago

Someone talks about nn.sequential giving them a gradient of 0 here: https://groups.google.com/forum/#!searchin/torch7/gradient$20zero%7Csort:date/torch7/ArJgZliOqf8/LTa9OR0wDgAJ

I thought running things backwards through the network was where the issue might be occurring, but maybe it's something else? Search engines throw up some interesting results for this issue.

tastyminerals commented 7 years ago

Have you tried training without optim? Remove feval loop and track your loss. Also, please format your network code correctly.

viktorheli commented 7 years ago

tastyminerals, hi! Yes, I tried training without Optim, with nn basic SGD as in example from nn docs. Network training with default parameters and all work perfectly.

Also, please format your network code correctly Sorry. I'm not professional programmer. :( Please, can you give me example with correct formatting of network code. Thanks.

tastyminerals commented 7 years ago

@viktorheli edit your post like this:

Problem description ```lua NETWORK CODE GOES HERE ```

tastyminerals commented 7 years ago

Well, with learning rate of 100 your network definitely will not learn. The loss won't decrease if the number of layers is too high as well. You can't just add more layers and think that given any kind of data the network would improve or start learning something. For any network you always search the optimal hyperparams with which a network can be successfully trained by experimenting unless you know them beforehand. I think you're messing up with your hyperparams by setting them extremely high.

viktorheli commented 7 years ago

tastyminerals, thanks for the help me on formatting the code. He now looks better. :)

Yes, I'm experimented with different networks. But I have small amount of data, and if I use small network I get very big error. If I increase network size in my case, error is decreasing. In my experiments bigger network work better. Also, as you can see, I have problem with training network with Optim, but do not have problem with nn in simple SGD training for same network. This strange for me. :(

tastyminerals commented 7 years ago

Never train on a small dataset unless you're creating or debugging your network. Your network will quickly overfit on a small dataset with more layers. Adding even more layers or setting a learning rate too high might be the reason why your loss does not change. There is nothing wrong with optim just set your hyperparams correctly and get more data.

viktorheli commented 7 years ago

Ok. But in both cases - small dataset and big dataset, network must learn, isn't it? Or no? In my case I don't see any signs of learning. But I will try further experimenting with network size. Because I have no possibilities to increase dataset. I'm very interested to investigate this issue.

viktorheli commented 7 years ago

Made several tests. The network is learning, but with the default settings very very slow. The problem is not in Optim. Probably I have a small data set. Total ~ 150-700 counts. Increasing learningrate - network can be trained. Thank you all for your help.

torch / torch7

Optim does not update weights on big MLP network #1092