Closed viktorheli closed 7 years ago
Link to the issue on the optim repository: https://github.com/torch/optim/issues/164
ProGamerGov, hi! I worked with images (see my repo), but my tasks does not need big imageges. In my case Optim worked fine, despite the fact that the network takes up 7 GB of memory on GPU. I worked with convolution networks. In this case, from this issue, is a trivial MLP, and on big net I have a problem with Optim. Yesterday I tryed to train my network, from example described above, with nn sgd. All worked perfecly - epoch loss changed every iteration. We have a strange problem.
@viktorheli If we do have a similar problem or even the same problem, then it's not your data set that's the issue. The issue for me occurred with adam
, and lbfgs
. But I also use the MSECriterion and the various tensor manipulation tools. What packages are you using? Are you doing anything similar to me?
I added a bunch of debugging code to my project, which stored and printed the gradient every iteration, and various important various in between the main gradient changes, and that let me narrow down the suspects for the issue. Have you tried doing this yet?
Someone talks about nn.sequential giving them a gradient of 0 here: https://groups.google.com/forum/#!searchin/torch7/gradient$20zero%7Csort:date/torch7/ArJgZliOqf8/LTa9OR0wDgAJ
I thought running things backwards through the network was where the issue might be occurring, but maybe it's something else? Search engines throw up some interesting results for this issue.
Have you tried training without optim? Remove feval
loop and track your loss. Also, please format your network code correctly.
tastyminerals, hi! Yes, I tried training without Optim, with nn basic SGD as in example from nn docs. Network training with default parameters and all work perfectly.
Also, please format your network code correctly Sorry. I'm not professional programmer. :( Please, can you give me example with correct formatting of network code. Thanks.
@viktorheli edit your post like this:
Problem description ```lua NETWORK CODE GOES HERE ```
Well, with learning rate of 100
your network definitely will not learn. The loss won't decrease if the number of layers is too high as well. You can't just add more layers and think that given any kind of data the network would improve or start learning something. For any network you always search the optimal hyperparams with which a network can be successfully trained by experimenting unless you know them beforehand. I think you're messing up with your hyperparams by setting them extremely high.
tastyminerals, thanks for the help me on formatting the code. He now looks better. :)
Yes, I'm experimented with different networks. But I have small amount of data, and if I use small network I get very big error. If I increase network size in my case, error is decreasing. In my experiments bigger network work better. Also, as you can see, I have problem with training network with Optim, but do not have problem with nn in simple SGD training for same network. This strange for me. :(
Never train on a small dataset unless you're creating or debugging your network. Your network will quickly overfit on a small dataset with more layers. Adding even more layers or setting a learning rate too high might be the reason why your loss does not change. There is nothing wrong with optim just set your hyperparams correctly and get more data.
Ok. But in both cases - small dataset and big dataset, network must learn, isn't it? Or no? In my case I don't see any signs of learning. But I will try further experimenting with network size. Because I have no possibilities to increase dataset. I'm very interested to investigate this issue.
Made several tests. The network is learning, but with the default settings very very slow. The problem is not in Optim. Probably I have a small data set. Total ~ 150-700 counts. Increasing learningrate - network can be trained. Thank you all for your help.
Hi! I'm sorry, but i write issue in Optim repo, but has no got answer. :( I have a following problem:
I try training network for regression task with optim.sgd. But I see strange thing. If I add to network > 12-16 layers to my MLP, optim does not change weights and network does not learning. Network begin learning if I decrease number of layers. But in strange cases network with 16 layers begun learning with learning rate 2 or above. Network with 24 layers does not learning with learning rate 100 or above.
This behavior of "optim" very strange for me. But maybe I do not understanding simple things.
This MLP not learning (epoch loss not changed during training) because maybe BUG in Optim:
This mlp learning with learningrate 2:
My programm code:
For example: th simple-bug.lua -valid 20 -train 100000 -learningrate 2 -progress no Creating net for traning nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> output] (1): nn.Linear(28 -> 56) (2): nn.Sigmoid (3): nn.Linear(56 -> 58) (4): nn.Sigmoid (5): nn.Linear(58 -> 112) (6): nn.Sigmoid (7): nn.Linear(112 -> 114) (8): nn.Sigmoid (9): nn.Linear(114 -> 224) (10): nn.Sigmoid (11): nn.Linear(224 -> 226) (12): nn.Sigmoid (13): nn.Linear(226 -> 448) (14): nn.Sigmoid (15): nn.Linear(448 -> 450) (16): nn.Sigmoid (17): nn.Linear(450 -> 224) (18): nn.Sigmoid (19): nn.Linear(224 -> 112) (20): nn.Sigmoid (21): nn.Linear(112 -> 56) (22): nn.Sigmoid (23): nn.Linear(56 -> 7) (24): nn.Tanh }
Number of iteration: 20 Epochloss: -0.83493030276792
Number of iteration: 40 Epochloss: -0.83493030276792
Number of iteration: 60 Epochloss: -0.83493030276792
Number of iteration: 80 Epochloss: -0.83493030276792
Number of iteration: 100 Epochloss: -0.83493030276792
Number of iteration: 120 Epochloss: -0.83493030276792
Number of iteration: 140 Epochloss: -0.83493030276792
Number of iteration: 160 Epochloss: -0.83493030276792
Number of iteration: 180 Epochloss: -0.83493030276792
Number of iteration: 200 Epochloss: -0.83493030276792
As you can see, Epochloss does not change absolutely.
Tried to wait 24 hours. But the result does not change. If I reduce the network, then it starts to learn.
Dataset for test: https://www.dropbox.com/s/deom263k4zk14ur/simple-bug-dataset.t7?dl=0
Big thanks for help.