Rethink the way to process invalid actions

peter1591 commented 7 years ago

In some states, only a subset of actions are valid

Minions cannot attack (just summoned, or attacked)
No hand card can be played (not enough resource, no required target)
etc.

In current implementation

All actions are numbered from 1
Invalid actions are pre-filtered out as much as possible
The left (hopefully) valid actions are re-numbered from 1

Since later, a policy network might be used to

pick up the most promising action
The re-numbering process might not be a good idea
- E.g., State 1 has a promising action 'PLAY 3RD CARD'
- State 2, which is similar to State 1, also has a promising action 'PLAY 3RD CARD'
- But, since the re-number, the PLAY CARD action might be with a different number
- Might make the underlying policy network (e.g., deep neural network) a hard time to learn

Some thoughts

Do not re-number valid actions. Just filter them out if later the action is picked up.
state::State support find valid actions more deeply.

peter1591 commented 7 years ago

The policy network might needs a different design as AlphaGO

In game of GO, if we define the action as put a chess on XXX grid then, an action has similar impact even for different game states

However, in hearhstone, if we define the action as play the 3rd hand card such an action has dramatically different impact for different game states

In this sense, a better action definition might be play the card with id CARD_ID_XXX then, such an action has similar impact for different game states

But, this introduces some issues:

The possible actions are too many, including all playable cards
- The branch factor of GO is at most 19*19. All playable cards are about 2000 cards.
- AlphaGo trained a policy network to predict expert moves.
- So, to narrow down the possible actions to those are most promising.
- If the policy network produced a value for each possible action
- The final layer of the neural network consist so many nodes
- Any performance issue?
- The neural network will be easy to trained? good to generalize?
Only a small subset of those possible actions are actually valid.
- Those are hold in hand, and has enough resource to play them.
- Maybe we can apply an action filter policy to the policy network
- But, we can only apply this filter after the policy network
- Since the policy network has a fixed number of output neurons
- And, we apply this filter to those output. So we can choose the most promising action from available actions.

Thus, the policy network might need a total re-design.

But, even under this situation A value network might still do.

peter1591 commented 7 years ago

To the value network:

Deathrattle needs to be considered
- But, do we need to consider the type of deathrattle?
- Deathrattles are currently with no ID available.
- Only consider as a boolean flag, indicating a deathrattle or no.
- Or, only considered as a integer, indicating the number of deathrattles attached.
- But, in most cases, there are only one deathrattle. Maybe not necessary
Enchantments needs to be considered?
- Some event-attached enchantments, like draw a card when attacking

But, if we use value network to quickly judge a board situation, we can defer those detail game interactions in MCTS search phase This is somehow similar to the fast rollout policy network used in AlphaGo

peter1591 commented 7 years ago

If we use a value network to generate a promising action, this action should not be mapped again through the valid-action-analyzer. It should pass to board directly. So, a re-mapping to the actions might be alright.

peter1591 commented 7 years ago

There's a valid-action-getter to record the valid actions. We should pass that to simulation policy, so the neural network (or something else) can get advantage of it.

peter1591 / hearthstone-ai

Rethink the way to process invalid actions #45