Closed tcfuji closed 4 years ago
Hi @tcfuji , yes it would be because is_evaluation=True
. DQN is fully deterministic in this case, since its epsilon is set to zero:
So each agent is maximizing over their individual Q-network. Since breakthrough has no randomness in the environment and the Q-networks are not being changes during the evaluation, every episode will be exactly the same over the 1000 evaluation games and hence give you the exact same return-- zero variance! :)
It's an interesting point - how do you evaluate the best agent accurately in this case? Would you retain some element of randomness in play through a +ve value of epsilon? I've spent a lot of time playing with RL on backgammon which doesn't suffer from this issue due to the randomness introduced by the dice, so you can easily evaluate fully greedy agent performance.
There's no one clear answer; proper evaluation in multiagent RL is very difficult in general. The easiest thing is compare to a fixed reference player (such as uniform random). Another easy one is to checkpoint your agents every so-often and compare against them all (like, expected utility playing against a uniform distribution over all previous check points).
Specifically for games such as breakthrough, you can just compute Elo by playing tournaments among all the previous checkpoints. It should mostly work because breakthrough is in the right class of games for that metric (two-player perfect information), but we showed when using agents based on learned value function and search it can still be non-transitive (see see https://arxiv.org/abs/1803.06376). But this should still be a good enough indicator of general progress.
Nash averaging from Balduzzi et al. is even better. You can also do a lot of other things between the checkpoints like empirical game-theoretic analysis, look at the learning dynamics based only on pairwise utility (see https://arxiv.org/abs/1803.06376 and https://arxiv.org/abs/1909.09849), or alpha-rank (https://arxiv.org/abs/1903.01373). It's very much still an active research area. :)
@lanctot I'm kicking myself for not thinking of that simple explanation. Also, I appreciate the articles you mentioned. Any other thoughts on this discussion would be great but feel free to close the issue whenever you want.
Thank you!
Don't kick yourself-- you suggested it quite precisely with your is_evaluation=True
comment... I just knew exactly where to look to prove you right! :)
Hi,
This might be more of an RL theory question but I came across some interesting behavior after making slight adjustments to
open_spiel/python/examples/breakthrough_dqn.py
. I wanted to see how well the two DQN agents played against each other during evaluation, so I editedeval_against_random_bots
so that they only faced each other instead of random agents (by commenting out lines 56, 57). I'm relatively new to RL so it was difficult for me to explain the output beyond handwaving:In other words, one agent is winning all 1000 games during each evaluation. Is this behavior expected? I was wondering if this is expected greedy behavior since
is_evaluation=True
. If so, can anyone provide a rigorous explanation for this?