henrycharlesworth / big2_PPOalgorithm

Application of proximal policy optimization algorithm to the card game Big 2 using Tensorflow
72 stars 28 forks source link

Why avail_pi equals to pi + available_moves? #4

Closed boscotsang closed 6 years ago

boscotsang commented 6 years ago

As far as I understand, avaiable_moves is a mask where 1 identify the moves are available while 0 indentify unavailable moves. (pi + available_moves) just add 1 to the logits where the moves are available but cannot avoid sampling unavailable moves. Besides, in optimization process, unavailable moves are also get involed. Did I understand correctly? Thank you!

henrycharlesworth commented 6 years ago

Yeah I can see why this is confusing - actually the way I've defined available moves is that it is a mask which takes the value 0 if the move is available, and -infinity if it isn't. There's a comment in PPONetwork.py: "#available_moves takes form [0, 0, -inf, 0, -inf...], 0 if action is available, -inf if not." Then actually selecting the move is done with a softmax, which means anything with -inf has zero probability. Hopefully this makes sense now!

boscotsang commented 6 years ago

Thanks for your replay. Sorry for not noticing the comment.