Closed peter1591 closed 7 years ago
The policy network might needs a different design as AlphaGO
In game of GO, if we define the action as put a chess on XXX grid then, an action has similar impact even for different game states
However, in hearhstone, if we define the action as play the 3rd hand card such an action has dramatically different impact for different game states
In this sense, a better action definition might be play the card with id CARD_ID_XXX then, such an action has similar impact for different game states
But, this introduces some issues:
Thus, the policy network might need a total re-design.
But, even under this situation A value network might still do.
To the value network:
But, if we use value network to quickly judge a board situation, we can defer those detail game interactions in MCTS search phase This is somehow similar to the fast rollout policy network used in AlphaGo
If we use a value network to generate a promising action, this action should not be mapped again through the valid-action-analyzer. It should pass to board directly. So, a re-mapping to the actions might be alright.
There's a valid-action-getter to record the valid actions. We should pass that to simulation policy, so the neural network (or something else) can get advantage of it.
In some states, only a subset of actions are valid
In current implementation
Since later, a policy network might be used to
Some thoughts