maym2104 / keras-pysc2

StarCraft II DRL agents using keras and pysc2
0 stars 0 forks source link

Keras agent not learning #1

Open maym2104 opened 6 years ago

maym2104 commented 6 years ago

For now, the agent does not seem to learn anything, or at least not the right thing

The loss is non-zero and the weights vary after each training pass, but the policy seems to be the same (i.e. random) even after several training steps.

For my implementation of A2C, I got inspiration from keras-rl that uses a 3rd model (for DQN at least), called trainable_model, to compute the total loss, instead of letting keras engine do it. This has the advantage to compute the loss using other arguments than y_true and y_pred, and having a finer control on what's computed. Note that I've tried other approaches that did not use a 3rd or even a 2nd model, without success either.

maym2104 commented 6 years ago

I managed to have a working version by changing the following elements:

On all the changes, only the first 2 seems of importance. The split instead of the [] is particularly surprising; the brackets is tensor operation like any other and I would have thought that back-propagation was implemented for this operation as well.

Once I cleaned up the working version I will commit it here.

maym2104 commented 6 years ago

See issue on optmizers: https://github.com/keras-team/keras/issues/5564. In my case, when I pass an (RMSProp) object, it rapidly get to a no_op behaviour. With a string/dictionary with the exact same parameters (which is serialized only once the optimizer is passed as argument to the compile function), I have random behaviour, but not sure now if it's going to learn anything.

maym2104 commented 6 years ago

I managed to use RMSProp from Keras. I updated all libraries (don't know if it changed anything). It learns, but more slowly than with the TF optmizer (it appears so) and significantly more slowly than a TF agent with the TF RMSProp optimizer. On other minigames than MoveToBeacon, it gets stuck after a while. For example, it never (or rarely) gets a score above 40 in CollectMineralShards (which is 2 boards or minerals). The performance just plateaus at this point and never reach 100 like other implementations.