hughperkins / DeepCL

OpenCL library to train deep convolutional neural networks
Mozilla Public License 2.0
865 stars 199 forks source link

Python Q-Learning - Add Dropout Layout causes runtime error #129

Closed 0StackOverflow0 closed 6 years ago

0StackOverflow0 commented 6 years ago

If I add a dropout layer to the Q-Learning example in python (I'm fiddling with a larger grid and network). I get the following error.

Traceback (most recent call last): File "C:\Program Files\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner self.run() File "C:\Program Files\Anaconda3\lib\threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "QLearning.pyx", line 10, in PyDeepCL.QLearner._run (PyDeepCL.cpp:17740) RuntimeError: need to copyToDevice() before calling kernel->input

It's being placed after the first fully connected layer and before the subsequent activation function.

I'm not so familiar with OpenCL (though that's why I came to DeepCL). I'll read through the related code segments to get a better understanding of DeepCL. I can't promise that I can solve this, but I'll try to help.

I'm hoping to later create a much more complex DQN I'm interested in seeing how this regularization technique helps.

hughperkins commented 6 years ago

I'm not so familiar with OpenCL (though that's why I came to DeepCL). I'll read through the related code segments to get a better understanding of DeepCL. I can't promise that I can solve this, but I'll try to help.

Ok. The error message is something like:

0StackOverflow0 commented 6 years ago

It's definitely happening during backprop, but I still can't quite figure out why.

I have a feeling it has to do with the limited set of experiences in the beginning,

I did notice that while gradInputWrapper and outputWrapper use createOnDevice() in: https://github.com/hughperkins/DeepCL/blob/master/src/dropout/DropoutLayer.cpp#L86 weightsWrapper does not.

However, in DropoutLayer::backward, copyToDevice() is called on weightsWrapper, which should satisfy all input requirements because that function will createOnDevice() if it isn't already onDevice. (And gradOutputWrapper is sourced from the next layer)

I don't really see anything different from conv/backward, or activate/backward.

Best I can think to do with my limited troubleshooting knowledge would be to attempt some logging and recompile.

0StackOverflow0 commented 6 years ago

As an aside, I was testing the following network in your Q-Learning example (Gridsize = 15, Random = True) 18C9z-8C3z-128N-4N LR = 0.0025, Reg = 0.0000001 qLambda = 0.945 maxSamples = 128 epsilon = 0.1 rewards: move: -0.03 wall: -0.1 win: 1.0

I let it run overnight (900+ rounds now), and it seems quite adept at the game mostly, then one game (seems like one in 50 or so) will take quite a while (10k+ actions) and then it'll do the next ones no problem.

I'm interested to see how it's doing by the end of today, but I'm curious if you have some suggestions for experimentation.

(After 1800 games, it seems to have mastered the 15x15wRand)

0StackOverflow0 commented 6 years ago

Also, this may not be the best place to ask, but if I wanted to create a QNetwork that would take complex actions (simultaneous and ones with variability), would I really need to create an action set to combine all possibles action states (i.e. hundreds)?

Do you think it would be possible to have a set of outputs where a portion is treated as action states, and the others as action variation (i.e. 5 outputs, [right, left, down, up, distance]) (distance if bound -1 to 1 could be scaled using gridSize). I could try out customizing Qlearner and modifying the example scenario to test this out.

I've read that a Sigmoid activation and Multinomial Cross Entropy loss could allow for multi-class selection for classification issues. I was thinking that if an action set contained items that should be selected simultaneously, there could be an action grouping. Each group will have a high Q value selection and each index is passed forward. (i.e action groupings = direction[right, left, down, up] distance[1, 2, 3... N] valid action [left,5] Depending on the granularity of variability, this could still lead to many output nodes.

What do you think?

0StackOverflow0 commented 6 years ago

All I can say further for now is that it is during the second step of Qlearner.run(), when it is training on the first experience and during backprop of dropoutlayer.

0StackOverflow0 commented 6 years ago

I figured out a hack

If I net.setTraining(True), then it works just fine. (Edit: I later used setTraining(False) in the first act(self, index); this way the training bit was only set once. The network continued to work, but I think for all intents and purposes, training should be set when training (I believe this affects Random Patch and Dropout only). I believe this would provide optimal action choice and experience replay while getting the true benefits of a dropout layer.)

https://github.com/hughperkins/DeepCL/blob/master/src/qlearning/QLearner.cpp#L99 I believe there should be a flag change before and after this line.

https://github.com/hughperkins/DeepCL/blob/master/src/dropout/DropoutLayer.cpp#L208 Here is the culprit (but not the issue necessarily) I believe, but I can't yet confirm.

hughperkins commented 6 years ago

Ok. Thinking about it, without looking at the code, the dropout layer needs to keep the dropout tensor around, for backprop. If turning training off fixes a problem during backprop, it sounds like somehow that tensor is being recreated somehow? You might want to check carefully then what Wrapper object holds the dropout tensor, ie the 1s and 0s choosing which inputs to forward to the output, and what happens to it, its lifecycle etc.

Your analysis sounds excellent! Very much appreciate your looking into this :)

0StackOverflow0 commented 6 years ago

From what I see, masks is declared as a new unsigned char[]. This is the wrapped variable.

However, it's innards are not given assignment until generateMasks() is called, which only happens during foward prop when training is set to true. This declared, but un-initialized array is attemping to be copied during back prop.

This is probably the issue.

Also, From my understanding, and I may be wrong, but training shouldn't be set during predictions because you want the full network. However, in your alternate 'forward' path (not training), you multiple the dropRatio by the inputs (depressing every node), but I believe it is supposed to be 1 to 1 copy. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf page 1931 Figure 2

(edit: I read the document further and found your implementation to be perfect... my bad)

I was having issues compiling, but I figured it out now. I'll make a pull request after I can test the changes. I'm running a couple experiments, so I'll wait for those first.

0StackOverflow0 commented 6 years ago

My reruns failed >.> My dropout got twice as far as the non, but the weights vanished (After 150 games)

With some modification testing I can say that it doesn't crash.

It's up to you on the non-training forward implementation.

However, my thought of only having the dropout during the actual training step, rather than across the whole learnFromPast() seems to not work.

I get vanishing weights fairly early... (Python reports nan for every weight) (I've been outputting them to file now to review dispersion and to see if a partially trained network after losing it's experiences can build better ones.)

I'm testing flagging training during the whole learnFromPast() but allowing the action prediction to be full network.

This seems to be working as intended, but I'll wait til tomorrow to confirm.

0StackOverflow0 commented 6 years ago

I think there might be some conceptual issues with the experience replay. (edit: nope again, you're implementation is standard)

I'll try digging into the white papers to see if I'm right. If so, I'll code up a proposal.

(edit: The ideas I was having seem to be refered to as ranked experience replay prioritization (possibly not that easy to implement, but alternatively I could experiment with removing uninteresting old experiences), prevent loops from doping experiences, and enhanced experience replay. (Tracking valuable sequences of actions and replaying them backwards))

The last piece seems to be pretty important for a complex environment. Notably, I've been bothered by the standard experience replay implementation because it doesn't take into account the full future reward.

Q(s,a) = r + decay(Q(s',a'))

The consequence of the definition is that you don't know the second half until later, but experience reward value isn't updated later when subsequent reward is found. So in the replay, the estimation of future reward is done poorly by using our own (yet current) predictions.

By replaying a rewardful sequence backwards, you could effectively compensate for this disparity. Effectively giving a real decay to reward and helping out possible state/actions that led to it.

0StackOverflow0 commented 6 years ago

Sorry for making this a microblog.

I found given an environment so large, a distance reward was helpful.

I attempted to reload a network (wipe all experience) after having completed 50 and 500 rounds. 2x15x15-8C5z-drop(0.5)-relu-8C7-relu-128N-tanh-mse-4x1x1

Each of these had two reloads, one with the full network and one with a perceptron (static conv network) and the loaded FC. 2x15x15-8C5z-drop(0.5)-relu-8C7-relu-2x8x8 --> 128N-sgmd-sftmx-4x1x1

I found that I kept getting trapped in local minimum (a very common one)(entire q field homogenous), until I changed the last activations to sigmoid/softmax for the dual network, or scaledTanh/softmax for the full.

It's not quite there yet, but the perceptron (trained 500 rounds) with changed final activations, seems to behave about as well as the first successful one after it had about 1200+ rounds (though this had no distance reward, and was incredibly lucky it converged; after 1800+ rounds no less, but a larger network)

edit: LOL I realized my network had blurry vision, but nice performance, so i looked at my second layer. 2x15x15-8C5z-drop(0.5)-elu-18C5-elu-128N-scaledtanh-mse-4x1x1

I changed the second conv network, relu to elu and tanh to scaledtanh. Decay was reduced to 0.8. It very nearly converged after only 85 rounds in less than 30 minutes.

After about 350 rounds, the weights disappeared again...

2x15x15-8C5z-drop(0.5)-relu-18C5-linear-128N-tanh-mse-4x1x1

That seems to do the job nicely.

For one without a distance based reward, a negative action reward for a non-winning move seems to retard the network, but without a prioritized replay (doping it with terminal moves) it'll take a long time to converge, but now it shouldn't get caught in a local minimum at least (and should thus be reproducible).