mokemokechicken / reversi-alpha-zero

Reversi reinforcement learning by AlphaGo Zero methods.
MIT License
676 stars 169 forks source link

About MCTS #17

Open apollo-time opened 6 years ago

apollo-time commented 6 years ago

https://github.com/mokemokechicken/reversi-alpha-zero/blob/f1cfa6c7177ec5f76a89e20fd97eb4c5d678611d/src/reversi_zero/agent/player.py#L165-L168

I see update N and W with virtual loss when select the node in order to discourages other threads from simultaneously exploring the identical variation (in paper).

  1. Why don't update Q with N/W at this time?
  2. Isn't it W=W+virtual loss when player is white?
  3. Why didn't share tree between two players?
mokemokechicken commented 6 years ago

Hi @apollo-time

  1. Why don't update Q with N/W at this time?
  2. Isn't it W=W+virtual loss when player is white?

Thank you very good point! That is a serious bug for virtual loss (Virtual Loss of W didn't work).

  1. Why didn't share tree between two players?

Because if models of black and white are different, MCTS results are also different.

apollo-time commented 6 years ago

https://github.com/mokemokechicken/reversi-alpha-zero/blob/527ce6ce1b83175c8b2c34c6b51334a67b02c9b1/src/reversi_zero/worker/self_play.py#L63-L64

I see two players use the same model in self play mode.

mokemokechicken commented 6 years ago

Yes, that's right. Although it is a little difficult to implement, sharing tree search results may be useful to save computation costs.

gooooloo commented 6 years ago

Just for your reference, I am sharing tree search between 2 players, see codes here: https://github.com/gooooloo/alpha-zero-in-python/blob/master/src/reversi_zero/agent/player.py

But I don't think this makes big difference. Many other settings are much more important, such as simulation number, resignation threshold, performance trade-off between self/opt/eval module, etc.

apollo-time commented 6 years ago

I see DeepMind backup reward to parent nodes without modify. Why don't use discount-rate γ?

mokemokechicken commented 6 years ago

Why don't use discount-rate γ?

It is a diffucult question.

Conversely, the reasons to use discount-rate are

I think

  1. Far steps are less causual relationship.
  2. Generally it is better to get rewards early.

Thinking that way, the reasons not to use discount-rate are

  1. In games with perfect information like Go and Reversi, all moves are related with the final reward.
  2. There is not much meaning even if it wins the game quickly.
apollo-time commented 6 years ago

But I think the first step is not related with the final result as final step, when the game length is long.

mokemokechicken commented 6 years ago

Although there is only one kind of the first move of reversi, it does not matter, but maybe there is a possibility that the first move becomes a bad move in go and chess.