Closed merceyz closed 7 years ago
Also this seems to be a common "theme", it does really well for a while and then it shoots up to infinity
(ignore the sample difference, i'm just testing the trainers)
The loaders are at https://github.com/hughperkins/DeepCL/tree/master/src/trainers , and the main is at https://github.com/hughperkins/DeepCL/blob/master/src/main/train.cpp Where are you seeing epsilon?
As far as the training nans... training nans is a perennial problem with neural nets. There are a few possible sources, none of which are mutually exclusive, could be a bit of all of them :-P :
There's no hard and fast rule or check to know which is which... I suppose what I would do is:
As far as 'digging a bit more', you're almost certainly need to roll up your sleeves and get stuck into the code, so I would try the first two steps first. I thnk that to 'get stuck into the code', at minimum, you'd probably want to do something like:
If it was me, I'd do this currenlty probalby using python. In the past, I would have done it directly in hte C++.
PS one thing you could try is, assume its a gpu kernel bug, so make everything run on the cpu, by modifying https://github.com/hughperkins/DeepCL/blob/master/src/conv/BackpropWeights.cpp#L51 to return true
only for kernle 0 (ie the cpu kernel), and ditto for Forward, and Backward. If this doesnt create NaNs, there might be a bug in one or more of the gpu kernels, for the specific geometry you are using.
(added a 'PS')
The loaders are at https://github.com/hughperkins/DeepCL/tree/master/src/trainers , and the main is at https://github.com/hughperkins/DeepCL/blob/master/src/main/train.cpp Where are you seeing epsilon?
See the eps (epsilon) parameter https://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html
Can also be found here just to name a few https://keras.io/optimizers/
all is perfectly in order, but for optimal learning you need some kind of gradient truncation, normalization, or regularization
If i'm not mistaken deepcl already adds a normalization layer after the input layer automatically right?
Anyways this is the config i'm currently running
deepcl_train trainer=adadelta anneal=1e-08 rho=0.95 batchsize=128 numepochs=4000 netdef=4*(60c3z-relu-mp2)-150n-relu-150n-relu-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
Values are gathered from the second link on where to find the "epsilon" value, i assumed the "eps" was the same as "anneal"
Ah. I think that epsilon is probalby eg https://github.com/hughperkins/DeepCL/blob/master/src/trainers/AdadeltaState.cpp#L32 ie fuzzfactor/fudgefactor. It's hardcoded for now. It shouldnt affect very much I htink.
If i'm not mistaken deepcl already adds a normalization layer after the input layer automatically right
This normzalies incoming images. But I mean normalizing: weights. Or gradients. Or both. Or at least truncating gradients onto the unit ball. Weight decay is fairly standard.
deepcl_train trainer=adadelta anneal=1e-08 rho=0.95 learningrate=1.0 batchsize=128 numepochs=4000 netdef=4*(60c3z-relu-mp2)-150n-relu-150n-relu-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
learningrate=1.0 is pretty ambitious I think, without using batch-normzliation, which deepcl doesnt support. I htink you will have more success with a learning rate more like 0.001 or 0.0001
Network arcitecture looks reaosnably standard. I'd be tempted to use tanh
after the fc layers ,rather than relu. Might want only two fc layers perhaps, like maybe -150n-tanh-2n
perhaps?
oh, better to make the number of feature planes a power of 2, on the whole, eg 64c3z
. I'm not sure if it will make any difference at all for deepcl kernels, but as a general rule, gpu kernels will tend to be more optimized for powers of 2 for the number of feature planes.
Ah. I think that epsilon is probalby eg https://github.com/hughperkins/DeepCL/blob/master/src/trainers/AdadeltaState.cpp#L32 ie fuzzfactor/fudgefactor. It's hardcoded for now. It shouldnt affect very much I htink.
Alright, that means the anneal value is not useful in this instance right?
learningrate=1.0 is pretty ambitious I think, without using batch-normzliation, which deepcl doesnt support. I htink you will have more success with a learning rate more like 0.001 or 0.0001
I just took the default value from the second link, but as i saw first now (facepalm) on the Stanford page it was a lower, perhaps normal, number
This normzalies incoming images. But I mean normalizing: weights. Or gradients. Or both. Or at least truncating gradients onto the unit ball. Weight decay is fairly standard.
How would i achieve this kind of normalizing?
oh, better to make the number of feature planes a power of 2, on the whole, eg 64c3z. I'm not sure if it will make any difference at all for deepcl kernels, but as a general rule, gpu kernels will tend to be more optimized for powers of 2 for the number of feature planes.
Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD
deepcl_train trainer=adadelta rho=0.95 learningrate=0.001 batchsize=128 numepochs=4000 netdef=4*(64c3z-relu-mp2)-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
Alright, that means the anneal value is not useful in this instance right?
Well, it's not related to the adadelta fudgefactor yeah. anneal basically slowly reduces the learning rate over time. It's a bit tricky to use though. On the whole I think a standard approach is:
I just took the default value from the second link, but as i saw first now (facepalm) on the standford page it was a lower, perhaps normal, number
ok
How would i achieve this kind of normalizing?
SGD has weight decay https://github.com/hughperkins/DeepCL/blob/master/src/trainers/SGD.cpp#L65 It would have to be added into specific trainers, on a case by case basis.
Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD
Ah. Well... ok. I would guess most of hte memory is going into the first few layers (since they're bigger, no mp2's yet), so you could graudally increase the number of planes, something like:
16c3z-relu-mp2-32c3z-relu-mp2-2*(64c3z-relu-mp2)
I havent tried this, not sure if it's a good architecture, just demonstrating the concept of increasing hte number of planes after each -mp2
SGD has weight decay https://github.com/hughperkins/DeepCL/blob/master/src/trainers/SGD.cpp#L65 It would have to be added into specific trainers, on a case by case basis.
That would be the "l2_decay" specified on the Stanford page correct?
I just took the default value from the second link, but as i saw first now (facepalm) on the Stanford page it was a lower, perhaps normal, number
Actually i read it wrong, the Stanford page uses a learning rate of 1.0 for the adadelta trainer
Ah. Well... ok. I would guess most of hte memory is going into the first few layers (since they're bigger, no mp2's yet), so you could graudally increase the number of planes, something like:
16c3z-relu-mp2-32c3z-relu-mp2-2*(64c3z-relu-mp2) I havent tried this, not sure if it's a good architecture, just demonstrating the concept of increasing hte number of planes after each -mp2
It didn't crash my system which is a good start, it seems to be doing rather good
Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD
L2 decay is what you need. I'm 70% sure that what I linked to is L2, but I'd want to double-check somewhere to be sure it's not L1. (I think its L2, because the derivative of x squared is simply x, so that's why we simply subract some fraction of hte current weight here).
Actually i read it wrong, the Stanford page uses a learning rate of 1.0 for the adadelta trainer
Ok
It didn't crash my system which is a good start, it seems to be doing rather good
cool :-)
I just ran it like this deepcl_train trainer=adadelta rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
And yet again at epoch 8 it goes south
However running this
deepcl_train learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
Seems to be fine, loss goes up a bit at epoch 11 and 12 then back down at 13
Ok. You mean, using SGD trainer instead of adadelta trainer?
As SGD is the default trainer yes, so there might be a bug somewhere in the adadelta trainer?
could be... the code is at https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp Feel free to scrutinize/tweak it. Personally, I've mostly used SGD, not used adadelta too much, so not ipmossible some buggette remains somewhere.
I tried with the adagrad trainer which is now at epoch 27 and is constantly getting better and better, 99,5794% and a loss of 370,28
deepcl_train trainer=adagrad rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
So i'll assume something is wrong in the adadelta trainer.
could be... the code is at https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp Feel free to scrutinize/tweak it. Personally, I've mostly used SGD, not used adadelta too much, so not ipmossible some buggette remains somewhere.
I sadly don't know how it's supposed to be implemented so I can't really "proofread" it
So i'll assume something is wrong in the adadelta trainer.
Ok, thats good info. I will file a bug
edit: oh, the title of this issue, here, this thread, is adadelta error. so ... good :-)
(note: the adadelta paper is here: www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf the update rule is in 'algorithm 1':
We'd need to compare this equatoin with what is written in https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp#L55-L74 Unfortunately, it's a bit illegible, though better than it was originally. the bug could be either in this list of operations (first thing to check), or in the underlying vector arithmetic implementations (though there are unittests for this). On the whole, I guess it's most likely the bug is in this chunk of code, linked from this paragraph, though it's mostly a guess.
What the code says is:
clWorking = clGradWeights;
// copy clGradWeights vector into clWorking vector
clWorking.squared();
// take per-element square of each element in clWorking, so they equal gradient squared
// this is probalby the gt2 terms in the equation above (gt is clGradWeights)
// (writing gt squared as gt2)
clWorking *= (1 - decay);
// (1 - decay) is probably (1 - p) (writing rho as p)
// so clWorking now holds (1-p)gt2
clSumGradSquared *= decay;
// by comparison with equation 8 (see below),
// it looks like clsumgradsquared holds the running average of the g2 elements over time,
// ie E[g2]
// so, from the code, we now have:
// clSumGradSquared is: p * E[g2]
clSumGradSquared += clWorking;
// now, clSumGradSqured is: p * E[g2] + (1-p)gt2
// ie, looks like step 4, in the algorithm screenshot above
edit: ah this bit is equation 8:
edit2: next bit of code:
clWorking = clSumGradSquared;
// copy p * E[g2] + (1-p)gt2 into clWorking
// so, clWorking is: p * E[g2] + (1-p)gt2
clWorking.inv();
// calculate 1/ clWorking, for each element, so now each elemnt of clWorking is:
// 1 / (p * E[g2] + (1-p)gt2)
clWorking *= clSumUpdateSquared;
// I guess that `update` is delta x in the equation in the screenshot, which we can
// write maybe AX (since A looks a bit like the delta symbol, the triangle)
// I guess that ... hmmm.... seems like we are calculating equation 9 and step 5
// equation 9 is:
but ... in equation 9 ,there is an epsilon, which is what you mentioned above
... and in the code ... no epsilon :-P maybe this is the bug
Maybe the code should be modified to insert the following line in between line 62 and line 63:
clWorking += 0.0000001f;
(where 0.0000001f
is epsilon
)
Added in 07eb0d6 We'll see if that helps. I should trigger a build really
When you get a moment, http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.2.2alpha1.zip has an updated Adadelta. We can try this version, see if it helps.
http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.2.2alpha1.zip
I downloaded it and moved over my Data folder (images and manifest) and it is sadly getting stuck.
If i wait long enough and hit ctrl + c (cancel) it outputs "Empty input file"
It's RAM usage is at 3,14 GB and CPU usage is 0,00%
Any updates regarding the issue(s)?
I'm not sure. I think the issue was fixed. There seems to be some problem with the build process in general (I just migrated to msvc2015 , a bit gratuitously), and I dont know how to diagnose/fix that. It sounds like a lot of work... I need to pluck up courage, roll up my sleeves, and look into why the new build is not working... Can you try to check if other stuff is working, on this build? or is everything entirely broken on this build?
unittests seems to run fine
Alright. What about running simple mnist training? Just normal sgd and so on? Is it only adadelta that is broken? Or things are more broken generally? If it's just adadelta broken, that simplifies, since then it's not a build issue, just some logic issue in adadelta, which should be not too hard for me to fix, hopefully, probably...
I was about to try using another trainer on that build and noticed a bug with the loading of manifest data.
If the manifest and it's data is not on the C drive it fails to parse it, this also happens on the previous build
Arguments:
trainer=adagrad rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-3n datadir="E:\Data" trainfile=Train.txt validatefile=Validation.txt
Error:
Something went wrong: Error reading E:\Data/Train.txt. Following line not parseable: E:\Data\0-Unknown\05d2dd67-3369-4c9c-abd3-29a0c0f83f15.jpeg 0
Hmmm, good spot. So, you're saying the most recent alpha build mostly works, but a couple of bugs, eg doesnt handle drives other than c: drive for manifest?
I moved the data back over to the c drive and tested there, it got further but once one path in the manifest had a space in it, it threw the same error
ok, cool. sounds like build is ok (ie, its not an msvc2015 issue, which sounds painful to debug...), but just some code bug. I'll try to spin up a windows instance and take a peek.
I now ran the same commands on the alpha build you linked which resulted in it getting stuck
Oh. I kind of thought you were testing on hte alpha build :-P Can you find out what does/doesnt run on hte alpha build? Is there anything at all working on the alpha build?
I was testing on both the previous build and the alpha build to see what was what
The issues on both are: If one of the paths in the manifest has a space in it, it fails
The issue on alpha is: It gets stuck, see screenshot in previous comment
Is there anything at all working on the alpha build
unittests runs fine, haven't tested predict
understood that on alpha it gets stuck. Questoin: are there any training scenarios on alpha whichdont get stuck? eg, if you use sgd etc, does it still get stuck? or only using adadelta?
Tested adagrad, adadelta and sgd. All stuck on the same location. It seems to be manifest loading, not the trainers for this issue
ok. if there's no manifest stuff, does it work then?
Note that this is pending more information on your part about under what circumstances it gets stuck :-)
Running deepcl_train datadir=c:\mnist numtrain=1280 numtest=1280
works.
Could i just add trainer=adadelta or is sgd hardcoded for the mnist set?
You should be able to use any trainer. There is nothing magical about mnist particularly. Looks like manifest loader is broken. That's going to be a lot easier to fix than an entirely broken build. I'll try to spin up a Windows box after lunch, and take a peek. (If you have a moment to check that adadelta works ok-ish on mnist, at least, doesnt get stuck, that would be very useful).
The adadelta trainer got up to 99,9% accuracy on the trainingset and 95,78% on the testing set. I let it run for 220 epochs and it didn't go to INF/IND
I got the same result on build 8.5.1. which when ran on my data goes to INF/IND
Ok, so, just to summarize:
manifest hangs on alpha
Correct
manifest doesnt accept e: drive on alpha¨
It's the space in the path that does it, don't think the drive letter has anything to do with it
adadleta working ok on alpha?
Both alpha and 8.5.1 worked with adadelta on mnist. Once the manifest works i can test it on my data set.
Ok. current hypothesis: something wrong with the build of libjpeg I'm using.
(switching jpeg libraries https://github.com/hughperkins/DeepCL/commit/473c11dc851d5e8d0fcd0615d1d2938e00f2d99a ... building....)
Havent checked/fixed anything to do with spaces or d: drive yet, but the deepcl-win64-v11.3.0alpha1.zip build switches jpeg libraries, and might at least no longer freeze??? http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.3.0alpha1.zip (I havent tested it yet; but the unittests do at least run better than before, which is a good start)
It did not get stuck :) Loading also seems faster.
Testing adadelta now
cool :-)
I noticed that if the validation set points to a file that the training set also points to, it wont get loaded for validation.
what do you mean by loaded
? Like , a message on the stdout?
As a test i had the content of test.txt and validation.txt be identical. Meaning the training and test result should be identical? I assumed some files didn't load
Ah. Well, if you look at the number of images for each, it /6696
, it is the same.
The train and test accuracies are different for two reasons, in general:
randomtranslations
and so on, htese are turned off for testNormally though, I'd expect the test accuracy to be at lesat as high as train accuracy, on the whole, if the data is the same, since the weights used for test will be the most recent weights, should be the best.
Hello,
I was looking at the different trainers and reading some documents on them when i noticed a value called "epsilon". This value is nowhere to be seen in the API documentation and thus i assume it's missing. (Unless it's the "anneal" option which would be awkward for me)