hughperkins / DeepCL

OpenCL library to train deep convolutional neural networks
Mozilla Public License 2.0
865 stars 199 forks source link

Unknown Adadelta trainer error/issues #87

Closed merceyz closed 7 years ago

merceyz commented 7 years ago

Hello,

I was looking at the different trainers and reading some documents on them when i noticed a value called "epsilon". This value is nowhere to be seen in the API documentation and thus i assume it's missing. (Unless it's the "anneal" option which would be awkward for me)

merceyz commented 7 years ago

Also this seems to be a common "theme", it does really well for a while and then it shoots up to infinity 70db370756473a6c7c46c8215d524e4e f70d9bfbe76932ed21d5bf63b5ed1593

(ignore the sample difference, i'm just testing the trainers)

hughperkins commented 7 years ago

The loaders are at https://github.com/hughperkins/DeepCL/tree/master/src/trainers , and the main is at https://github.com/hughperkins/DeepCL/blob/master/src/main/train.cpp Where are you seeing epsilon?

As far as the training nans... training nans is a perennial problem with neural nets. There are a few possible sources, none of which are mutually exclusive, could be a bit of all of them :-P :

There's no hard and fast rule or check to know which is which... I suppose what I would do is:

As far as 'digging a bit more', you're almost certainly need to roll up your sleeves and get stuck into the code, so I would try the first two steps first. I thnk that to 'get stuck into the code', at minimum, you'd probably want to do something like:

If it was me, I'd do this currenlty probalby using python. In the past, I would have done it directly in hte C++.

PS one thing you could try is, assume its a gpu kernel bug, so make everything run on the cpu, by modifying https://github.com/hughperkins/DeepCL/blob/master/src/conv/BackpropWeights.cpp#L51 to return true only for kernle 0 (ie the cpu kernel), and ditto for Forward, and Backward. If this doesnt create NaNs, there might be a bug in one or more of the gpu kernels, for the specific geometry you are using.

hughperkins commented 7 years ago

(added a 'PS')

merceyz commented 7 years ago

The loaders are at https://github.com/hughperkins/DeepCL/tree/master/src/trainers , and the main is at https://github.com/hughperkins/DeepCL/blob/master/src/main/train.cpp Where are you seeing epsilon?

See the eps (epsilon) parameter https://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html

Can also be found here just to name a few https://keras.io/optimizers/

all is perfectly in order, but for optimal learning you need some kind of gradient truncation, normalization, or regularization

If i'm not mistaken deepcl already adds a normalization layer after the input layer automatically right? Anyways this is the config i'm currently running deepcl_train trainer=adadelta anneal=1e-08 rho=0.95 batchsize=128 numepochs=4000 netdef=4*(60c3z-relu-mp2)-150n-relu-150n-relu-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt Values are gathered from the second link on where to find the "epsilon" value, i assumed the "eps" was the same as "anneal"

hughperkins commented 7 years ago

Ah. I think that epsilon is probalby eg https://github.com/hughperkins/DeepCL/blob/master/src/trainers/AdadeltaState.cpp#L32 ie fuzzfactor/fudgefactor. It's hardcoded for now. It shouldnt affect very much I htink.

If i'm not mistaken deepcl already adds a normalization layer after the input layer automatically right

This normzalies incoming images. But I mean normalizing: weights. Or gradients. Or both. Or at least truncating gradients onto the unit ball. Weight decay is fairly standard.

deepcl_train trainer=adadelta anneal=1e-08 rho=0.95 learningrate=1.0 batchsize=128 numepochs=4000 netdef=4*(60c3z-relu-mp2)-150n-relu-150n-relu-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

learningrate=1.0 is pretty ambitious I think, without using batch-normzliation, which deepcl doesnt support. I htink you will have more success with a learning rate more like 0.001 or 0.0001

Network arcitecture looks reaosnably standard. I'd be tempted to use tanh after the fc layers ,rather than relu. Might want only two fc layers perhaps, like maybe -150n-tanh-2n perhaps?

hughperkins commented 7 years ago

oh, better to make the number of feature planes a power of 2, on the whole, eg 64c3z. I'm not sure if it will make any difference at all for deepcl kernels, but as a general rule, gpu kernels will tend to be more optimized for powers of 2 for the number of feature planes.

merceyz commented 7 years ago

Ah. I think that epsilon is probalby eg https://github.com/hughperkins/DeepCL/blob/master/src/trainers/AdadeltaState.cpp#L32 ie fuzzfactor/fudgefactor. It's hardcoded for now. It shouldnt affect very much I htink.

Alright, that means the anneal value is not useful in this instance right?

learningrate=1.0 is pretty ambitious I think, without using batch-normzliation, which deepcl doesnt support. I htink you will have more success with a learning rate more like 0.001 or 0.0001

I just took the default value from the second link, but as i saw first now (facepalm) on the Stanford page it was a lower, perhaps normal, number

This normzalies incoming images. But I mean normalizing: weights. Or gradients. Or both. Or at least truncating gradients onto the unit ball. Weight decay is fairly standard.

How would i achieve this kind of normalizing?

oh, better to make the number of feature planes a power of 2, on the whole, eg 64c3z. I'm not sure if it will make any difference at all for deepcl kernels, but as a general rule, gpu kernels will tend to be more optimized for powers of 2 for the number of feature planes.

Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD

deepcl_train trainer=adadelta rho=0.95 learningrate=0.001 batchsize=128 numepochs=4000 netdef=4*(64c3z-relu-mp2)-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

hughperkins commented 7 years ago

Alright, that means the anneal value is not useful in this instance right?

Well, it's not related to the adadelta fudgefactor yeah. anneal basically slowly reduces the learning rate over time. It's a bit tricky to use though. On the whole I think a standard approach is:

I just took the default value from the second link, but as i saw first now (facepalm) on the standford page it was a lower, perhaps normal, number

ok

How would i achieve this kind of normalizing?

SGD has weight decay https://github.com/hughperkins/DeepCL/blob/master/src/trainers/SGD.cpp#L65 It would have to be added into specific trainers, on a case by case basis.

Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD

Ah. Well... ok. I would guess most of hte memory is going into the first few layers (since they're bigger, no mp2's yet), so you could graudally increase the number of planes, something like:

16c3z-relu-mp2-32c3z-relu-mp2-2*(64c3z-relu-mp2)

I havent tried this, not sure if it's a good architecture, just demonstrating the concept of increasing hte number of planes after each -mp2

merceyz commented 7 years ago

SGD has weight decay https://github.com/hughperkins/DeepCL/blob/master/src/trainers/SGD.cpp#L65 It would have to be added into specific trainers, on a case by case basis.

That would be the "l2_decay" specified on the Stanford page correct?

I just took the default value from the second link, but as i saw first now (facepalm) on the Stanford page it was a lower, perhaps normal, number

Actually i read it wrong, the Stanford page uses a learning rate of 1.0 for the adadelta trainer

Ah. Well... ok. I would guess most of hte memory is going into the first few layers (since they're bigger, no mp2's yet), so you could graudally increase the number of planes, something like:

16c3z-relu-mp2-32c3z-relu-mp2-2*(64c3z-relu-mp2) I havent tried this, not sure if it's a good architecture, just demonstrating the concept of increasing hte number of planes after each -mp2

It didn't crash my system which is a good start, it seems to be doing rather good

hughperkins commented 7 years ago

Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD

L2 decay is what you need. I'm 70% sure that what I linked to is L2, but I'd want to double-check somewhere to be sure it's not L1. (I think its L2, because the derivative of x squared is simply x, so that's why we simply subract some fraction of hte current weight here).

Actually i read it wrong, the Stanford page uses a learning rate of 1.0 for the adadelta trainer

Ok

It didn't crash my system which is a good start, it seems to be doing rather good

cool :-)

merceyz commented 7 years ago

I just ran it like this deepcl_train trainer=adadelta rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

And yet again at epoch 8 it goes south

However running this deepcl_train learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt Seems to be fine, loss goes up a bit at epoch 11 and 12 then back down at 13

hughperkins commented 7 years ago

Ok. You mean, using SGD trainer instead of adadelta trainer?

merceyz commented 7 years ago

As SGD is the default trainer yes, so there might be a bug somewhere in the adadelta trainer?

hughperkins commented 7 years ago

could be... the code is at https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp Feel free to scrutinize/tweak it. Personally, I've mostly used SGD, not used adadelta too much, so not ipmossible some buggette remains somewhere.

merceyz commented 7 years ago

I tried with the adagrad trainer which is now at epoch 27 and is constantly getting better and better, 99,5794% and a loss of 370,28

deepcl_train trainer=adagrad rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

So i'll assume something is wrong in the adadelta trainer.

could be... the code is at https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp Feel free to scrutinize/tweak it. Personally, I've mostly used SGD, not used adadelta too much, so not ipmossible some buggette remains somewhere.

I sadly don't know how it's supposed to be implemented so I can't really "proofread" it

hughperkins commented 7 years ago

So i'll assume something is wrong in the adadelta trainer.

Ok, thats good info. I will file a bug

edit: oh, the title of this issue, here, this thread, is adadelta error. so ... good :-)

(note: the adadelta paper is here: www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf the update rule is in 'algorithm 1': adadeltarule2

We'd need to compare this equatoin with what is written in https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp#L55-L74 Unfortunately, it's a bit illegible, though better than it was originally. the bug could be either in this list of operations (first thing to check), or in the underlying vector arithmetic implementations (though there are unittests for this). On the whole, I guess it's most likely the bug is in this chunk of code, linked from this paragraph, though it's mostly a guess.

hughperkins commented 7 years ago

What the code says is:

    clWorking = clGradWeights;
// copy clGradWeights vector into clWorking vector
    clWorking.squared();
// take per-element square of each element in clWorking, so they equal gradient squared
// this is probalby the gt2 terms in the equation above (gt is clGradWeights)
// (writing gt squared as gt2)
    clWorking *= (1 - decay);
// (1 - decay) is probably (1 - p)  (writing rho as p)
// so clWorking now holds (1-p)gt2
    clSumGradSquared *= decay;
// by comparison with equation 8 (see below),
// it looks like clsumgradsquared holds the running average of the g2 elements over time,
// ie E[g2]
// so, from the code, we now have:
// clSumGradSquared is:  p * E[g2]
    clSumGradSquared += clWorking;
// now, clSumGradSqured is: p * E[g2] + (1-p)gt2
// ie, looks like step 4, in the algorithm screenshot above

edit: ah this bit is equation 8: adadelta_equation8

edit2: next bit of code:

    clWorking = clSumGradSquared;
// copy p * E[g2] + (1-p)gt2 into clWorking
// so, clWorking is: p * E[g2] + (1-p)gt2
    clWorking.inv();
// calculate 1/ clWorking, for each element, so now each elemnt of clWorking is:
// 1 / (p * E[g2] + (1-p)gt2)
    clWorking *= clSumUpdateSquared;
// I guess that `update` is delta x in the equation in the screenshot, which we can
// write maybe AX  (since A looks a bit like the delta symbol, the triangle)
// I guess that ... hmmm.... seems like we are calculating equation 9 and step 5
// equation 9 is:

equation9

but ... in equation 9 ,there is an epsilon, which is what you mentioned above
... and in the code ... no epsilon :-P  maybe this is the bug
hughperkins commented 7 years ago

Maybe the code should be modified to insert the following line in between line 62 and line 63:

clWorking  += 0.0000001f;

(where 0.0000001f is epsilon)

hughperkins commented 7 years ago

Added in 07eb0d6 We'll see if that helps. I should trigger a build really

hughperkins commented 7 years ago

When you get a moment, http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.2.2alpha1.zip has an updated Adadelta. We can try this version, see if it helps.

merceyz commented 7 years ago

http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.2.2alpha1.zip

I downloaded it and moved over my Data folder (images and manifest) and it is sadly getting stuck.

97ad5005c49913a2cb5e74fa0dfcb5a7

If i wait long enough and hit ctrl + c (cancel) it outputs "Empty input file"

It's RAM usage is at 3,14 GB and CPU usage is 0,00%

merceyz commented 7 years ago

Any updates regarding the issue(s)?

hughperkins commented 7 years ago

I'm not sure. I think the issue was fixed. There seems to be some problem with the build process in general (I just migrated to msvc2015 , a bit gratuitously), and I dont know how to diagnose/fix that. It sounds like a lot of work... I need to pluck up courage, roll up my sleeves, and look into why the new build is not working... Can you try to check if other stuff is working, on this build? or is everything entirely broken on this build?

merceyz commented 7 years ago

unittests seems to run fine

hughperkins commented 7 years ago

Alright. What about running simple mnist training? Just normal sgd and so on? Is it only adadelta that is broken? Or things are more broken generally? If it's just adadelta broken, that simplifies, since then it's not a build issue, just some logic issue in adadelta, which should be not too hard for me to fix, hopefully, probably...

merceyz commented 7 years ago

I was about to try using another trainer on that build and noticed a bug with the loading of manifest data.

If the manifest and it's data is not on the C drive it fails to parse it, this also happens on the previous build

Arguments:

trainer=adagrad rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-3n datadir="E:\Data" trainfile=Train.txt validatefile=Validation.txt

Error:

Something went wrong: Error reading E:\Data/Train.txt. Following line not parseable: E:\Data\0-Unknown\05d2dd67-3369-4c9c-abd3-29a0c0f83f15.jpeg 0

hughperkins commented 7 years ago

Hmmm, good spot. So, you're saying the most recent alpha build mostly works, but a couple of bugs, eg doesnt handle drives other than c: drive for manifest?

merceyz commented 7 years ago

I moved the data back over to the c drive and tested there, it got further but once one path in the manifest had a space in it, it threw the same error

hughperkins commented 7 years ago

ok, cool. sounds like build is ok (ie, its not an msvc2015 issue, which sounds painful to debug...), but just some code bug. I'll try to spin up a windows instance and take a peek.

merceyz commented 7 years ago

I now ran the same commands on the alpha build you linked which resulted in it getting stuck 7e78c4b42435480da6ba0739601e26a2

hughperkins commented 7 years ago

Oh. I kind of thought you were testing on hte alpha build :-P Can you find out what does/doesnt run on hte alpha build? Is there anything at all working on the alpha build?

merceyz commented 7 years ago

I was testing on both the previous build and the alpha build to see what was what

The issues on both are: If one of the paths in the manifest has a space in it, it fails

The issue on alpha is: It gets stuck, see screenshot in previous comment

Is there anything at all working on the alpha build

unittests runs fine, haven't tested predict

hughperkins commented 7 years ago

understood that on alpha it gets stuck. Questoin: are there any training scenarios on alpha whichdont get stuck? eg, if you use sgd etc, does it still get stuck? or only using adadelta?

merceyz commented 7 years ago

Tested adagrad, adadelta and sgd. All stuck on the same location. It seems to be manifest loading, not the trainers for this issue

hughperkins commented 7 years ago

ok. if there's no manifest stuff, does it work then?

hughperkins commented 7 years ago

Note that this is pending more information on your part about under what circumstances it gets stuck :-)

merceyz commented 7 years ago

Running deepcl_train datadir=c:\mnist numtrain=1280 numtest=1280 works.

Could i just add trainer=adadelta or is sgd hardcoded for the mnist set?

hughperkins commented 7 years ago

You should be able to use any trainer. There is nothing magical about mnist particularly. Looks like manifest loader is broken. That's going to be a lot easier to fix than an entirely broken build. I'll try to spin up a Windows box after lunch, and take a peek. (If you have a moment to check that adadelta works ok-ish on mnist, at least, doesnt get stuck, that would be very useful).

merceyz commented 7 years ago

The adadelta trainer got up to 99,9% accuracy on the trainingset and 95,78% on the testing set. I let it run for 220 epochs and it didn't go to INF/IND

I got the same result on build 8.5.1. which when ran on my data goes to INF/IND

hughperkins commented 7 years ago

Ok, so, just to summarize:

merceyz commented 7 years ago

manifest hangs on alpha

Correct

manifest doesnt accept e: drive on alpha¨

It's the space in the path that does it, don't think the drive letter has anything to do with it

adadleta working ok on alpha?

Both alpha and 8.5.1 worked with adadelta on mnist. Once the manifest works i can test it on my data set.

hughperkins commented 7 years ago

Ok. current hypothesis: something wrong with the build of libjpeg I'm using.

hughperkins commented 7 years ago

(switching jpeg libraries https://github.com/hughperkins/DeepCL/commit/473c11dc851d5e8d0fcd0615d1d2938e00f2d99a ... building....)

hughperkins commented 7 years ago

Havent checked/fixed anything to do with spaces or d: drive yet, but the deepcl-win64-v11.3.0alpha1.zip build switches jpeg libraries, and might at least no longer freeze??? http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.3.0alpha1.zip (I havent tested it yet; but the unittests do at least run better than before, which is a good start)

merceyz commented 7 years ago

It did not get stuck :) Loading also seems faster.

Testing adadelta now

hughperkins commented 7 years ago

cool :-)

merceyz commented 7 years ago

I noticed that if the validation set points to a file that the training set also points to, it wont get loaded for validation.

hughperkins commented 7 years ago

what do you mean by loaded? Like , a message on the stdout?

merceyz commented 7 years ago

As a test i had the content of test.txt and validation.txt be identical. Meaning the training and test result should be identical? I assumed some files didn't load

793212975c530d2f4ad58cdc9f535e67

hughperkins commented 7 years ago

Ah. Well, if you look at the number of images for each, it /6696, it is the same.

The train and test accuracies are different for two reasons, in general:

Normally though, I'd expect the test accuracy to be at lesat as high as train accuracy, on the whole, if the data is the same, since the weights used for test will be the most recent weights, should be the best.