Open maym2104 opened 6 years ago
I managed to have a working version by changing the following elements:
On all the changes, only the first 2 seems of importance. The split instead of the [] is particularly surprising; the brackets is tensor operation like any other and I would have thought that back-propagation was implemented for this operation as well.
Once I cleaned up the working version I will commit it here.
See issue on optmizers: https://github.com/keras-team/keras/issues/5564. In my case, when I pass an (RMSProp) object, it rapidly get to a no_op behaviour. With a string/dictionary with the exact same parameters (which is serialized only once the optimizer is passed as argument to the compile function), I have random behaviour, but not sure now if it's going to learn anything.
I managed to use RMSProp from Keras. I updated all libraries (don't know if it changed anything). It learns, but more slowly than with the TF optmizer (it appears so) and significantly more slowly than a TF agent with the TF RMSProp optimizer. On other minigames than MoveToBeacon, it gets stuck after a while. For example, it never (or rarely) gets a score above 40 in CollectMineralShards (which is 2 boards or minerals). The performance just plateaus at this point and never reach 100 like other implementations.
For now, the agent does not seem to learn anything, or at least not the right thing
The loss is non-zero and the weights vary after each training pass, but the policy seems to be the same (i.e. random) even after several training steps.
For my implementation of A2C, I got inspiration from keras-rl that uses a 3rd model (for DQN at least), called trainable_model, to compute the total loss, instead of letting keras engine do it. This has the advantage to compute the loss using other arguments than y_true and y_pred, and having a finer control on what's computed. Note that I've tried other approaches that did not use a 3rd or even a 2nd model, without success either.