Open keithgw opened 10 months ago
Updated thoughts on tree policy, playout policy, returned policy, and opponent policy.
With such a large state-action space, will not start with persisiting the tree between real game turns. Instead, we'll build the tree anew each turn.
Since the transition model is stochastic, the game tree is going to get very large. It is unclear how expansion should work with a stochastic transition mocdel.
Updated thoughts on tree policy:
Including nodes that represent opponent players' turns would allow for a minimax like policy. This could be a potential improvement to the policy later, but for now, we will only represent nodes in the game tree when it is the acting player's turn. Then, the simple policy will be to max_a(Q(s, a)).
Going to follow the exptimax approximation from https://arxiv.org/pdf/0909.0801.pdf to deal with stochasticity. This will require two types of nodes: decision and chance.
Decision Nodes:
Chance Nodes:
This will require the "flattening" of decision nodes, so that chance nodes are always children. For example, the decision node that starts a turn at "choose action" would have had "play a bird" as a decision node child. Instead, the "play a bird" action should get flattened to "play bird i" for i in |hand|. Similarly, "draw a bird" gets flattened to "draw tray bird i" for i in |tray| + "draw from deck"
There is no need to enforce that decision nodes and chance nodes strictly alternate. Handling how the tree should be traversed using recursion and handling traversal through decision, chance, leaf, and terminal nodes differently will allow for any arbitrary construction of the game tree.
MCTS will allow the agent to "playout" a game from the current state to generate a distribution over action-values. This will be used to generate a policy: state -> action.