Implement basic MCTS - Githubissues

keithgw commented 10 months ago

MCTS will allow the agent to "playout" a game from the current state to generate a distribution over action-values. This will be used to generate a policy: state -> action.

Will need to be able to generate a game tree with Nodes and Edges.
Will need to select edges from the current root node R until a leaf node L with no child node is reached.
Will need to expand by generating all child nodes from leaf node L.
Will need to simulate by choosing a child node from L and "playout" the game to completion.
Will need to back propogate the outcome of the simulation to all nodes travelled in the playout.

keithgw commented 10 months ago

back propagate with simple win/loss/draw, later can experiment with point differential
select with naive max(win probability), later can add e-greedy, thompson sampling or ucb
simulate with uniform random policy

keithgw commented 10 months ago

Updated thoughts on tree policy, playout policy, returned policy, and opponent policy.

playout policy is the simplest. This is used for playing out from an unvisited child node. This should use the random policy for choosing actions.
tree policy: this is used during the selection step, and is used to traverse the tree from the root node to a leaf node. This inherently balances exploration and exploitation by treating each step as a bandit problem. Will use UCB as a starting point, though no reason Thompson sampling shouldn't work.
returned policy: this is the action that is returned from the root node after all simulations are complete. Can act greedily or probabilistically. Greedy can be with respect to average simulation reward or number of visits to children. Visits is more conservative, but more robust to reward variance. Start here, and can later investigate reward variance.
opponent policy: Will start with random policy, but will want to persist policies between games, and have the opponent use trained policies. Most obviously persisted by training a value function, but can also be achieved through policy iteration methods.

keithgw commented 10 months ago

With such a large state-action space, will not start with persisiting the tree between real game turns. Instead, we'll build the tree anew each turn.

keithgw commented 10 months ago

Since the transition model is stochastic, the game tree is going to get very large. It is unclear how expansion should work with a stochastic transition mocdel.

keithgw commented 10 months ago

Updated thoughts on tree policy:

Drawing cards introduces stochasticity to the state transition model. However, the probabilities can be calculated from known information and the construction of the deck. Therefore, in the selection step, the tree policy should choose the maximum probability weighted UCB value to traverse the tree.
To deal with the stochasticity due to opponent play, one need not enumerate all children of nodes where it is the opponent's turn. Instead, simulating one turn having the opponent follow a policy, e.g. the uniform random policy, will enable expansion of the tree through simulation. Because of the markov property, selection from a given node can happen without knowing anything about how one arrived at that node.

keithgw commented 10 months ago

Including nodes that represent opponent players' turns would allow for a minimax like policy. This could be a potential improvement to the policy later, but for now, we will only represent nodes in the game tree when it is the acting player's turn. Then, the simple policy will be to max_a(Q(s, a)).

keithgw commented 10 months ago

Going to follow the exptimax approximation from https://arxiv.org/pdf/0909.0801.pdf to deal with stochasticity. This will require two types of nodes: decision and chance.

Decision Nodes:

Edges are actions
Children are chance nodes

Chance Nodes:

Edges are observed next states
Children are decision nodes

This will require the "flattening" of decision nodes, so that chance nodes are always children. For example, the decision node that starts a turn at "choose action" would have had "play a bird" as a decision node child. Instead, the "play a bird" action should get flattened to "play bird i" for i in |hand|. Similarly, "draw a bird" gets flattened to "draw tray bird i" for i in |tray| + "draw from deck"

keithgw commented 10 months ago

There is no need to enforce that decision nodes and chance nodes strictly alternate. Handling how the tree should be traversed using recursion and handling traversal through decision, chance, leaf, and terminal nodes differently will allow for any arbitrary construction of the game tree.

keithgw / wingspan

Implement basic MCTS #47