peter1591 / hearthstone-ai

A Hearthstone AI based on Monte Carlo tree search and neural nets written in modern C++.
302 stars 49 forks source link

Rethink the way to process invalid actions #45

Closed peter1591 closed 7 years ago

peter1591 commented 7 years ago

In some states, only a subset of actions are valid

  1. Minions cannot attack (just summoned, or attacked)
  2. No hand card can be played (not enough resource, no required target)
  3. etc.

In current implementation

  1. All actions are numbered from 1
  2. Invalid actions are pre-filtered out as much as possible
  3. The left (hopefully) valid actions are re-numbered from 1

Since later, a policy network might be used to

Some thoughts

  1. Do not re-number valid actions. Just filter them out if later the action is picked up.
  2. state::State support find valid actions more deeply.
peter1591 commented 7 years ago

The policy network might needs a different design as AlphaGO

In game of GO, if we define the action as put a chess on XXX grid then, an action has similar impact even for different game states

However, in hearhstone, if we define the action as play the 3rd hand card such an action has dramatically different impact for different game states

In this sense, a better action definition might be play the card with id CARD_ID_XXX then, such an action has similar impact for different game states

But, this introduces some issues:

  1. The possible actions are too many, including all playable cards
    • The branch factor of GO is at most 19*19. All playable cards are about 2000 cards.
    • AlphaGo trained a policy network to predict expert moves.
    • So, to narrow down the possible actions to those are most promising.
    • If the policy network produced a value for each possible action
    • The final layer of the neural network consist so many nodes
    • Any performance issue?
    • The neural network will be easy to trained? good to generalize?
  2. Only a small subset of those possible actions are actually valid.
    • Those are hold in hand, and has enough resource to play them.
    • Maybe we can apply an action filter policy to the policy network
    • But, we can only apply this filter after the policy network
    • Since the policy network has a fixed number of output neurons
    • And, we apply this filter to those output. So we can choose the most promising action from available actions.

Thus, the policy network might need a total re-design.

But, even under this situation A value network might still do.

peter1591 commented 7 years ago

To the value network:

But, if we use value network to quickly judge a board situation, we can defer those detail game interactions in MCTS search phase This is somehow similar to the fast rollout policy network used in AlphaGo

peter1591 commented 7 years ago

If we use a value network to generate a promising action, this action should not be mapped again through the valid-action-analyzer. It should pass to board directly. So, a re-mapping to the actions might be alright.

peter1591 commented 7 years ago

There's a valid-action-getter to record the valid actions. We should pass that to simulation policy, so the neural network (or something else) can get advantage of it.