Keras agent not learning

maym2104 commented 6 years ago

For now, the agent does not seem to learn anything, or at least not the right thing

The loss is non-zero and the weights vary after each training pass, but the policy seems to be the same (i.e. random) even after several training steps.

For my implementation of A2C, I got inspiration from keras-rl that uses a 3rd model (for DQN at least), called trainable_model, to compute the total loss, instead of letting keras engine do it. This has the advantage to compute the loss using other arguments than y_true and y_pred, and having a finer control on what's computed. Note that I've tried other approaches that did not use a 3rd or even a 2nd model, without success either.

maym2104 commented 6 years ago

I managed to have a working version by changing the following elements:

The preprocess observations functions: I split on the channel/feature axis (using tf.split) instead of slicing the same tensor (using []) at each index on the channel/feature axis.
Policy gradient loss computation: I compute the negation of the log only of the sampled action, (that I select using gather_nd) instead of computing the neg of the log on everything and then selecting the sampled action by multiplying with a one-hot vector (or using a sparse_categorical_crossentropy function). The latter (with a one-hot) corresponds to categorical_crossentropy, and I thought theoritically that it yields the same gradients (I verified so). Other implementations (e.g. baselines of OpenAI) use sparse categorical crossentropy.
Optimizer: I use the TF RMSProp optimizer that I wrap in a keras optimizer (using TFOptmizer) instead of the Keras RMSProp. It does not work with the latter. The implementations are similar and the only difference are the linear decay (1e-12) in Keras and the gradient norm (1.) that I did not set with the TFOptimizer. Other users reported issues with RMSProp or Adam, i.e a complete lack of learning, depending on how the optimizer was declared (an object vs a string).
Returns and values (shape: [batch_size, 1]) are squeezed (with k.squeeze) before computing the advantage and the loss. I'm not sure this changes anything.
All submodels are compiled. This should not change anything since they are not 'trained' (the main model trains their weights).
Non spatial actions were limited to 'player' only. It could be interesting to put last_actions and available_actions back and see if it affects learning.

On all the changes, only the first 2 seems of importance. The split instead of the [] is particularly surprising; the brackets is tensor operation like any other and I would have thought that back-propagation was implemented for this operation as well.

Once I cleaned up the working version I will commit it here.

maym2104 commented 6 years ago

See issue on optmizers: https://github.com/keras-team/keras/issues/5564. In my case, when I pass an (RMSProp) object, it rapidly get to a no_op behaviour. With a string/dictionary with the exact same parameters (which is serialized only once the optimizer is passed as argument to the compile function), I have random behaviour, but not sure now if it's going to learn anything.

maym2104 commented 6 years ago

I managed to use RMSProp from Keras. I updated all libraries (don't know if it changed anything). It learns, but more slowly than with the TF optmizer (it appears so) and significantly more slowly than a TF agent with the TF RMSProp optimizer. On other minigames than MoveToBeacon, it gets stuck after a while. For example, it never (or rarely) gets a score above 40 in CollectMineralShards (which is 2 boards or minerals). The performance just plateaus at this point and never reach 100 like other implementations.

maym2104 / keras-pysc2

Keras agent not learning #1