google-deepmind / open_spiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.
Apache License 2.0
4.11k stars 901 forks source link

Methods of evaluating RL agents #95

Closed tcfuji closed 4 years ago

tcfuji commented 4 years ago

Hi,

This might be more of an RL theory question but I came across some interesting behavior after making slight adjustments to open_spiel/python/examples/breakthrough_dqn.py. I wanted to see how well the two DQN agents played against each other during evaluation, so I edited eval_against_random_bots so that they only faced each other instead of random agents (by commenting out lines 56, 57). I'm relatively new to RL so it was difficult for me to explain the output beyond handwaving:

I1016 13:09:37.395194 4360885696 breakthrough_dqn.py:112] [1000] Mean episode rewards [ 1. -1.]
I1016 13:09:52.865281 4360885696 breakthrough_dqn.py:112] [2000] Mean episode rewards [ 1. -1.]
I1016 13:10:06.920001 4360885696 breakthrough_dqn.py:112] [3000] Mean episode rewards [-1.  1.]
I1016 13:10:21.182298 4360885696 breakthrough_dqn.py:112] [4000] Mean episode rewards [-1.  1.]
I1016 13:10:31.817948 4360885696 breakthrough_dqn.py:112] [5000] Mean episode rewards [ 1. -1.]
I1016 13:10:52.753594 4360885696 breakthrough_dqn.py:112] [6000] Mean episode rewards [-1.  1.]
W1016 13:10:52.762625 4360885696 deprecation.py:323] From ~/anaconda3/envs/openspiel/lib/python3.7/site-packages/tensorflow/python/training/saver.py:960: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
I1016 13:11:04.019631 4360885696 breakthrough_dqn.py:112] [7000] Mean episode rewards [-1.  1.]
I1016 13:11:21.075985 4360885696 breakthrough_dqn.py:112] [8000] Mean episode rewards [ 1. -1.]
I1016 13:11:37.402832 4360885696 breakthrough_dqn.py:112] [9000] Mean episode rewards [-1.  1.]
I1016 13:11:52.738426 4360885696 breakthrough_dqn.py:112] [10000] Mean episode rewards [ 1. -1.]
I1016 13:12:11.506350 4360885696 breakthrough_dqn.py:112] [11000] Mean episode rewards [ 1. -1.]
I1016 13:12:34.547983 4360885696 breakthrough_dqn.py:112] [12000] Mean episode rewards [ 1. -1.]
I1016 13:12:53.641427 4360885696 breakthrough_dqn.py:112] [13000] Mean episode rewards [-1.  1.]
I1016 13:13:09.464611 4360885696 breakthrough_dqn.py:112] [14000] Mean episode rewards [-1.  1.]
I1016 13:13:31.431576 4360885696 breakthrough_dqn.py:112] [15000] Mean episode rewards [-1.  1.]
I1016 13:13:52.370819 4360885696 breakthrough_dqn.py:112] [16000] Mean episode rewards [ 1. -1.]
I1016 13:14:11.459871 4360885696 breakthrough_dqn.py:112] [17000] Mean episode rewards [-1.  1.]
I1016 13:14:33.822476 4360885696 breakthrough_dqn.py:112] [18000] Mean episode rewards [-1.  1.]
I1016 13:14:54.612814 4360885696 breakthrough_dqn.py:112] [19000] Mean episode rewards [-1.  1.]
I1016 13:15:16.479357 4360885696 breakthrough_dqn.py:112] [20000] Mean episode rewards [ 1. -1.]
...

In other words, one agent is winning all 1000 games during each evaluation. Is this behavior expected? I was wondering if this is expected greedy behavior since is_evaluation=True. If so, can anyone provide a rigorous explanation for this?

lanctot commented 4 years ago

Hi @tcfuji , yes it would be because is_evaluation=True. DQN is fully deterministic in this case, since its epsilon is set to zero:

https://github.com/deepmind/open_spiel/blob/05f860a68db7821ccbad370ccd90a9825e5b1b3d/open_spiel/python/algorithms/dqn.py#L324

So each agent is maximizing over their individual Q-network. Since breakthrough has no randomness in the environment and the Q-networks are not being changes during the evaluation, every episode will be exactly the same over the 1000 evaluation games and hence give you the exact same return-- zero variance! :)

jamesdfrost commented 4 years ago

It's an interesting point - how do you evaluate the best agent accurately in this case? Would you retain some element of randomness in play through a +ve value of epsilon? I've spent a lot of time playing with RL on backgammon which doesn't suffer from this issue due to the randomness introduced by the dice, so you can easily evaluate fully greedy agent performance.

lanctot commented 4 years ago

There's no one clear answer; proper evaluation in multiagent RL is very difficult in general. The easiest thing is compare to a fixed reference player (such as uniform random). Another easy one is to checkpoint your agents every so-often and compare against them all (like, expected utility playing against a uniform distribution over all previous check points).

Specifically for games such as breakthrough, you can just compute Elo by playing tournaments among all the previous checkpoints. It should mostly work because breakthrough is in the right class of games for that metric (two-player perfect information), but we showed when using agents based on learned value function and search it can still be non-transitive (see see https://arxiv.org/abs/1803.06376). But this should still be a good enough indicator of general progress.

Nash averaging from Balduzzi et al. is even better. You can also do a lot of other things between the checkpoints like empirical game-theoretic analysis, look at the learning dynamics based only on pairwise utility (see https://arxiv.org/abs/1803.06376 and https://arxiv.org/abs/1909.09849), or alpha-rank (https://arxiv.org/abs/1903.01373). It's very much still an active research area. :)

tcfuji commented 4 years ago

@lanctot I'm kicking myself for not thinking of that simple explanation. Also, I appreciate the articles you mentioned. Any other thoughts on this discussion would be great but feel free to close the issue whenever you want.

Thank you!

lanctot commented 4 years ago

Don't kick yourself-- you suggested it quite precisely with your is_evaluation=True comment... I just knew exactly where to look to prove you right! :)