Closed PetrToman closed 12 years ago
I tried to reproduce this with Backprop, and am not seeing anything. Are you saying the iteration count is stuck as well, by frozen? I trained the iris data set with Backprop on a real low learning rate, so it would take awhile. everything looked fine.
Also tried a random data set(from workbench) with 9 inputs, 1 output. I then trained on RPROP. It will pretty much chew on that forever. I it trained down to 32%, and continues to get marginal improvements on wards, but did not see a freeze.
Here's my streamlined data: http://dione.zcu.cz/~toman40/encog/data1.zip
RProp stucks at 169.86% despite of increasing iterations. (Using Workbench 3.0.1 it goes to 4.47% in just 112 iterations).
Strange! But yes, I can reproduce what you are describing. Also trains quite well in 3.0.1. I will take a look, thanks!
Ah hah! I think I figured out what it is. See my note above. I will also create a unit test to cover saving larger format neural networks. This is also why my test showed okay.
The following unit test, which I just checked in, demonstrates this issue. This will turn the status of the build yellow, until I fix it. But clearly the following should work. It works on smaller networks.
public void testPersistLargeEG()
{
BasicNetwork network = new BasicNetwork();
network.addLayer(new BasicLayer(null,true,200));
network.addLayer(new BasicLayer(new ActivationSigmoid(),true,200));
network.addLayer(new BasicLayer(new ActivationSigmoid(),true,200));
network.addLayer(new BasicLayer(new ActivationSigmoid(),false,200));
network.getStructure().finalizeStructure();
network.reset();
EncogDirectoryPersistence.saveObject(EG_FILENAME, network);
BasicNetwork network2 = (BasicNetwork)EncogDirectoryPersistence.loadObject(EG_FILENAME);
double d = EngineArray.euclideanDistance(network.getStructure().getFlat().getWeights(),
network2.getStructure().getFlat().getWeights());
Assert.assertTrue(d<0.01);
}
That is not good! Encog mainline is pretty much useless with any neural network that results in a multi-line matrix. I added that code not too long ago. I will take a look! Thanks both for all the info.
kk, all yours!
Okay, Seema, I checked in a fix for your unit test. All is green again. If there is an SVM issue, I believe it may be a different issue. I am assigning back to you for verification of the SVM side.
Okay, I believe this is resolved. I was able to create a Neural Network (119->200->TANH->1->TANH) with the data you provided. I was able to get it to converge in a few hundred iterations with RPROP. Not every set of random values does as well, some do converge to a local minimum. I did a SVM search. It took awhile longer, and you often don't see any updates for range of itarations, as it is simply not finding anything. But after around 100 iterations, the SVM was below 100%.
Did you try this in Workbench using Analyst? I updated the sources and even added a debug output to EncogReadHelper - before line
double[] t = NumberList.fromList(CSVFormat.EG_FORMAT, line);
to be sure I use the version that Jeff fixed, but still cannot get RProp to work...
Okay, had not tried it in the analyst. When those values are normalized, which they should be, I do get the same result. I did some digging and I believe it to be the weight init, NWR. There were some changes to that post 3.0.1. I created issue #41 to look at this, since what Jeff fixed here was a bug in its own right.
Interesting that you make it work easily. I can make it converge only if I use multiple methods and run the training again - see video: http://dione.zcu.cz/~toman40/encog/encog_training_bug.zip
The main reason I think that it is this, is because if I use 3.0.1 workbench, use analyst, and train a neural network, it converges just fine, as you reported. I then take the .eg file (neural network) and egb (normalized training) and copy it to a new project and fire-up 3.1. If I train with just these two (outside of the analyst) it converges quite quickly. To exactly the error that 3.0.1 analyst did. YET, if I now take 3.1 and randomize the EG file, and retrain. It does terrible and quickly converges to a local minimum. A very high local minimum at that. We really need some better graphics on the trainer so that you can actually see why a training has stalled.... i.e. hidden neurons shutting down, all of the gradients going to zero (local minimum), some hybrid of the two, etc. But that is another point. In this case it is a pure local minimum it gets stuck on. If you randomize a neural network with NWR(the default for the analyst) in 3.0.1 and 3.1 and look at a weight histogram, they are VERY different. Plus I can tell just looking at it, the NWR logic is flawed, it does not touch every weight. So this is where I am going to look next. At the very least, you are causing me to find other issues on the way to what you are experiencing.
Training visualisation is a great idea - don't you want to create an issue for this?
I would also suggest adding integration tests with (reasonably) larger data, perhaps for each training method. It would take couple seconds during build, but it should prevent breaking Encog's main functionality, like in this case. (You can use my data, if you like.)
Agree on both points. The unit tests could definably use a more advanted test case for larger neural networks.
Thanks! Yes I will add that data set.
Just to remind: how about that 'new training visualisation' issue?
Sure, added issue #58.
EG files store the weight array slightly differently for large format networks, so that the weights are not on a single ginormous line, that can't be read into memory. Training is failing because these networks are not being either loaded or saved correctly and the end result is an array of zeros for most of the weight matrix. Such a neural network is not trainable.
--- From original report-- It seems that error reporting is broken in the latest Workbench (built from git sources) - at least for RProp and SVMSearch - "Current Error" just hangs after couple iterations (and the chart is also freezed, if displayed). Interestingly, it works with QProp, for example. (I have no problems with Workbench 3.0.1, using the same data.)