Issues with Hanabi rewards and self-play.

google-deepmind / open_spiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.

Apache License 2.0

4.17k stars 919 forks source link

Issues with Hanabi rewards and self-play. #68

Closed veviurko closed 4 years ago

veviurko commented 4 years ago

Hello, I was checking policy gradient algorithm on HLE. Firstly, I think there is something wrong with how Hanabi handles rewards (or I am missing something). I ran this simple code (similar to PG for poker example) and looked how agents save rewards. I made discount = 0, so agent._dataset['returns'] is a list of immediate rewards.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from open_spiel.python import policy
from open_spiel.python import rl_environment
from open_spiel.python.algorithms import policy_gradient

game = "hanabi"
num_players = 2
discount =0
env_configs = {"players": num_players, 'max_life_tokens' : 1, 'colors' : 2, 
               'ranks' : 5, 'hand_size' : 2, 'max_information_tokens' : 3, 'discount' : discount}
env = rl_environment.Environment(game, **env_configs)
info_state_size = env.observation_spec()["info_state"][0]
num_actions = env.action_spec()["num_actions"]

with tf.Session() as sess:
    agents = [policy_gradient.PolicyGradient(sess, idx, info_state_size, num_actions, 
                                             hidden_layers_sizes=(128,)) for idx in range(num_players)]

    sess.run(tf.global_variables_initializer())
    for ep in range(1):
        time_step = env.reset()
        while not time_step.last():
            player_id = time_step.observations["current_player"]
            print('Player %d' % player_id)
            fireworks_before = env._state.observation()[30:47]

            agent_output = agents[player_id].step(time_step)
            action_list = [agent_output.action]
            time_step = env.step(action_list)

            if not time_step.last():
                fireworks_after = env._state.observation()[30:47]
            else:
                fireworks_after = 'lost'

            print(fireworks_before, '-->', fireworks_after, '\n')
        for agent in agents:
            agent.step(time_step)
print('\n')
print('Agent 0 rewards history:')
print(agents[0]._dataset['returns'])
print('Agent 1 rewards history:')
print(agents[1]._dataset['returns'])

This is an example of weird output.

P0 get 0pt, then P1 gets 1pt and then P0 loses. As I understand, P0 should have rewards [1, -1] which is correct. However P1 should have [0] since he scored a point and right after P0 lost it.

Player0 Fireworks: R0 Y0 --> Fireworks: R0 Y0

Player 1 Fireworks: R0 Y0 --> Fireworks: R0 Y1

Player 0 Fireworks: R0 Y1 --> lost

Agent 0 rewards history: [1.0, -1.0] Agent 1 rewards history: [-1.0]

Second question is concerning self-play. I would like to run PG agent in a self-play there it plays with itself (not similar agent.) All examples I found in open_spiel have several copies of agents playing with each other and learning separately. As I understand, using one agent to play with itself in current implementation will not work since episode experience of different players will mix up.

elkhrt commented 4 years ago

Thanks for the report and the clear reproduction! We agree the behaviour isn't right and we are discussing how best to fix it. This may take us a couple of days, but we'll be as quick as we can.

On Mon, Sep 23, 2019 at 3:47 PM griprox notifications@github.com wrote:

Hello, I was checking policy gradient algorithm on HLE. Firstly, I think there is something wrong with how Hanabi handles rewards (or I am missing something). I ran this simple code (similar to PG for poker example) and looked how agents save rewards. I made discount = 0, so agent._dataset['returns'] is a list of immediate rewards.

from future import absolute_importfrom future import divisionfrom future import print_functionimport tensorflow as tffrom open_spiel.python import policyfrom open_spiel.python import rl_environmentfrom open_spiel.python.algorithms import policy_gradient

game = "hanabi" num_players = 2 discount =0 env_configs = {"players": num_players, 'max_life_tokens' : 1, 'colors' : 2, 'ranks' : 5, 'hand_size' : 2, 'max_information_tokens' : 3, 'discount' : discount} env = rl_environment.Environment(game, **env_configs) info_state_size = env.observation_spec()["info_state"][0] num_actions = env.action_spec()["num_actions"] with tf.Session() as sess: agents = [policy_gradient.PolicyGradient(sess, idx, info_state_size, num_actions, hidden_layers_sizes=(128,)) for idx in range(num_players)]
sess.run(tf.global_variables_initializer())
for ep in range(1):
    time_step = env.reset()
    while not time_step.last():
        player_id = time_step.observations["current_player"]
        print('Player %d' % player_id)
        fireworks_before = env._state.observation()[30:47]

        agent_output = agents[player_id].step(time_step)
        action_list = [agent_output.action]
        time_step = env.step(action_list)

        if not time_step.last():
            fireworks_after = env._state.observation()[30:47]
        else:
            fireworks_after = 'lost'

        print(fireworks_before, '-->', fireworks_after, '\n')
    for agent in agents:
        agent.step(time_step)print('\n')print('Agent 0 rewards history:')print(agents[0]._dataset['returns'])print('Agent 1 rewards history:')print(agents[1]._dataset['returns'])
This is an example of weird output.

P0 get 0pt, then P1 gets 1pt and then P0 loses. As I understand, P0 should have rewards [1, -1] which is correct. However P1 should have [0] since he scored a point and right after P0 lost it.

Player0 Fireworks: R0 Y0 --> Fireworks: R0 Y0

Player 1 Fireworks: R0 Y0 --> Fireworks: R0 Y1

Player 0 Fireworks: R0 Y1 --> lost

Agent 0 rewards history: [1.0, -1.0] Agent 1 rewards history: [-1.0]

Second question is concerning self-play. I would like to run PG agent in a self-play there it plays with itself (not similar agent.) All examples I found in open_spiel have several copies of agents playing with each other and learning separately. As I understand, using one agent to play with itself in current implementation will not work since episode experience of different players will mix up.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/deepmind/open_spiel/issues/68?email_source=notifications&email_token=AHAF7THK47GHPIR67IU35M3QLDJIZA5CNFSM4IZL2DZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNB2RTA, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAF7TGYGMUBKEHRXPUZCDTQLDJIZANCNFSM4IZL2DZA .

veviurko commented 4 years ago

Thanks for the report and the clear reproduction! We agree the behaviour isn't right and we are discussing how best to fix it. This may take us a couple of days, but we'll be as quick as we can. … On Mon, Sep 23, 2019 at 3:47 PM griprox @.> wrote: Hello, I was checking policy gradient algorithm on HLE. Firstly, I think there is something wrong with how Hanabi handles rewards (or I am missing something). I ran this simple code (similar to PG for poker example) and looked how agents save rewards. I made discount = 0, so agent._dataset['returns'] is a list of immediate rewards. from future import absolute_importfrom future import divisionfrom future import print_functionimport tensorflow as tffrom open_spiel.python import policyfrom open_spiel.python import rl_environmentfrom open_spiel.python.algorithms import policy_gradient game = "hanabi" num_players = 2 discount =0 env_configs = {"players": num_players, 'max_life_tokens' : 1, 'colors' : 2, 'ranks' : 5, 'hand_size' : 2, 'max_information_tokens' : 3, 'discount' : discount} env = rl_environment.Environment(game, env_configs) info_state_size = env.observation_spec()["info_state"][0] num_actions = env.action_spec()["num_actions"] with tf.Session() as sess: agents = [policy_gradient.PolicyGradient(sess, idx, info_state_size, num_actions, hidden_layers_sizes=(128,)) for idx in range(num_players)] sess.run(tf.global_variables_initializer()) for ep in range(1): time_step = env.reset() while not time_step.last(): player_id = time_step.observations["current_player"] print('Player %d' % player_id) fireworks_before = env._state.observation()[30:47] agent_output = agents[player_id].step(time_step) action_list = [agent_output.action] time_step = env.step(action_list) if not time_step.last(): fireworks_after = env._state.observation()[30:47] else: fireworks_after = 'lost' print(fireworks_before, '-->', fireworks_after, '\n') for agent in agents: agent.step(time_step)print('\n')print('Agent 0 rewards history:')print(agents[0]._dataset['returns'])print('Agent 1 rewards history:')print(agents[1]._dataset['returns']) This is an example of weird output. P0 get 0pt, then P1 gets 1pt and then P0 loses. As I understand, P0 should have rewards [1, -1] which is correct. However P1 should have [0] since he scored a point and right after P0 lost it. Player0 Fireworks: R0 Y0 --> Fireworks: R0 Y0 Player 1 Fireworks: R0 Y0 --> Fireworks: R0 Y1 Player 0 Fireworks: R0 Y1 --> lost Agent 0 rewards history: [1.0, -1.0] Agent 1 rewards history:* [-1.0] Second question is concerning self-play. I would like to run PG agent in a self-play there it plays with itself (not similar agent.) All examples I found in open_spiel have several copies of agents playing with each other and learning separately. As I understand, using one agent to play with itself in current implementation will not work since episode experience of different players will mix up. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#68?email_source=notifications&email_token=AHAF7THK47GHPIR67IU35M3QLDJIZA5CNFSM4IZL2DZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNB2RTA>, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAF7TGYGMUBKEHRXPUZCDTQLDJIZANCNFSM4IZL2DZA .

Thank you!

elkhrt commented 4 years ago

Our chosen solution is to require agent.step() to be called for every environment step on which there may be a reward, not just the ones at which it is the agent's turn to play. This requires a slight modification to the agent implementation, coming later today.

veviurko commented 4 years ago

Our chosen solution is to require agent.step() to be called for every environment step on which there may be a reward, not just the ones at which it is the agent's turn to play. This requires a slight modification to the agent implementation, coming later today.

Hello, it seems that your solution is still not working. After few tests I got following example: Player 0 Fireworks: R0 Y0 --> Fireworks: R0 Y0

Player 1 Fireworks: R0 Y0 --> Fireworks: R0 Y0

Player 0 Fireworks: R0 Y0 --> Fireworks: R0 Y1

Player 1 Fireworks: R0 Y1 --> Fireworks: R0 Y1

Player 0 Fireworks: R0 Y1 --> Fireworks: R0 Y1

Player 1 Fireworks: R0 Y1 --> Fireworks: R0 Y1

Player 0 Fireworks: R0 Y1 --> lost

Agent 0 rewards history: [0.0, 0.0, 0.0, -1.0] Agent 1 rewards history: [1.0, 0.0, -1.0]

In general, player who plays card does not save the reward into his buffer. Also, new implementation of PolicyGradient contains a bug, line 256 is self.player_id == time_step.current_player()): it raises error saying that time_step does not have .current_player() method. I guess the line should be self.player_id == time_step.observations['current_player']):

elkhrt commented 4 years ago

Did you update your main loop to call step() for every player on every timestep? e.g. agent_output = [agent.step(time_step) for agent in agents] time_step = env.step([agent_output[current_player].action])
Are you sure you have updated the whole repository? current_player has been added here: https://github.com/deepmind/open_spiel/blob/71d76631feb2e62ee1b90421acf981ee11a28fce/open_spiel/python/rl_environment.py#L97

veviurko commented 4 years ago

You are right, I did not update the main loop. However, now I did, and there is another problem. I am running code from below and it looks that now agent saves reward each step it takes -- even on those when it did None action.

info_state_size = env.observation_spec()["info_state"][0]
num_actions = env.action_spec()["num_actions"]
with tf.Session() as sess:
    agents = [policy_gradient.PolicyGradient(sess,player_id=player_id,info_state_size=info_state_size,
                                             num_actions=num_actions,hidden_layers_sizes=[8]) 
              for player_id in range(num_players)]

    sess.run(tf.global_variables_initializer())
    time_step = env.reset()
    while not time_step.last():
        current_player = time_step.observations["current_player"]
        fireworks_before = env._state.observation()[30:47]
        agent_output = [agent.step(time_step) for agent in agents]
        time_step = env.step([agent_output[current_player].action])
        if not time_step.last():
            fireworks_after = env._state.observation()[30:47]
        else:
            fireworks_after = 'lost'
        print(fireworks_before, '-->', fireworks_after, '\n')
    for agent in agents:
        agent.step(time_step)

print('\n')
print('Agent 0 rewards history:')
print(agents[0]._dataset['returns'])
print(agents[0]._dataset['actions'])
print('Agent 1 rewards history:')
print(agents[1]._dataset['returns'])
print(agents[1]._dataset['actions'])

Fireworks: R0 Y0 --> Fireworks: R0 Y0

Fireworks: R0 Y0 --> Fireworks: R1 Y0

Fireworks: R1 Y0 --> Fireworks: R1 Y1

Fireworks: R1 Y1 --> lost

Agent 0 rewards history: [0.0, 0.0, 0.0, 1.0, 1.0, -2.0] [5, None, 4, None, 3, None] Agent 1 rewards history: [0.0, 0.0, 0.0, 1.0, 1.0, -2.0] [None, 6, None, 2, None, 3]

elkhrt commented 4 years ago

Great! Glad it's now running okay. The behaviour you describe is working as intended. Both agents see the same (complete) reward history. Does this cause a problem?

veviurko commented 4 years ago

It seems very weird. May be I misunderstand how you approach learning in the multiplayer games in general. As far as I understand, there are two options:

Each player is essentially an individual network and learns only according to his own experience. I though that it is the case on open_spiel.
There is one network for all players, and each player just collects batch of experience. Then. all players' experience is used to make gradient step. How I understood the original Hanabi paper (and actually github/HLE has an example of Rainbow agent which support my idea) used 2nd approach, but I think that 1st is also plausible.

Now, it seems that what agent saves in its experience now is not what it should. For an agent all other players are basically part of the environment, so its interaction with it looks as follows: -- Agent makes action (not None action, the real one) -- Say, immediate reward after the action is r0 -- If episode is terminated, then agent just saves the reward -- If episode is not terminated, agent waits and accumulates the rewards -- Each next player i makes action, receives immediate reward r_i. Original 0 agent's reward r0 is then changed as r0 += r_i. -- If episode is finished after player i made action, then original agent 0 saves accumulated reward -- If episode does not finish until next agent 0 turn, then it accumulates reward from all other agents and saves it in the beginning of its next turn. Currently, agent just saves all differences in the game's score and I do not really understand why. Also, learning does not work with None actions. To check it you shall change batch_size to 2, for example, in policy_gradient_test. Correct me if I misunderstand something,

lanctot commented 4 years ago

I replied a bit longer on the reddit thread that I won't copy here, but basically these are two separate paradigms (both are used in different contexts in MARL).

1. Each player is essentially an individual network and learns only according to his own experience. I though that it is the case on open_spiel.

2. There is one network for all players, and each player just collects batch of experience. Then. all players' experience is used to make gradient step.
   How I understood the original Hanabi paper (and actually github/HLE has an example of Rainbow agent which support my idea) used 2nd approach, but I think that 1st is also plausible.

These are two known paradigms in the field of multiagent RL (MARL). 1 is known as independent RL and 2 is known as self-play learning. (1) is maybe more common in the MARL literature more generally whereas the setup the Hanabi paper took (and AlphaZero and TD-Gammon) which is more common in games is (2).

The RL agents such as policy gradients and DQN, interact through the rl_environment (this is partly for historical reasons). When you do that, you normally assume the independent RL setup, because each player is a separate agent and they both think of their environment from their own personal perspective (treating everything else as the environment).

Currently, agent just saves all differences in the game's score and I do not really understand why. Also, learning does not work with None actions. To check it you shall change batch_size to 2, for example, in policy_gradient_test.

The definition of the reward was a choice that we made in the Hanabi Challenge paper and corresponding Hanabi Learning Environment. The reason is that the environment should give a reward signal when a card is scored (since the agent knows this as it happens).

Correct me if I misunderstand something,

You're not wrong. The example you listed out exactly highlights the difference in independent RL and self-play. It is certainly a different setting. To get self-play, you should be able to simply sample data from a single policy, and then go back and loop through all the players and train a single model based on the data from all players.

fzvinicius commented 4 years ago

Hi @griprox , here is an alternative solution you might consider to get self-play working, which involves only a couple of changes to the main loop:

After creating the agents, sync their networks. You can use tf.assign for that.
In the beginning of the episode, sample a player position to be the learner (in 2-player game it's 0 or 1), this agent should .step as usual. The other agents should use .step with is_evaluation=True, so they don't update their networks or buffers.
Periodically sync the other agents' networks with the learner network.

Some considerations:

Syncing the networks too often might induce a harder learning / optimization task. This can cause a cyclical learning behavior (A beats B beats C beats A) or unstable learning process.
As opposed to Independent RL, agents in self-play will learn to "play in every player position" of the game. This might lead to a harder task (so you may need larger networks) or make it easier for the networks to learn as it could serve as a regularizer.
The method @lanctot presented is more data efficient, however slightly more complicated to implement. If you are going for large scale games I'd suggest using @lanctot's suggestion, otherwise the main_loop changes above will work just fine.

veviurko commented 4 years ago

@fzvinicius @lanctot Thank you very much for your responses! I am currently playing with self-play setup of my own implementation, but I hope to try independent agents using open_spiel implementation soon.

akileshbadrinaaraayanan commented 4 years ago

Hi @griprox did you have any success in training A2C agents under self-play setting on Hanabi either in your own implementation or using Open Spiel? If so, do you mind sharing the results?

akileshbadrinaaraayanan commented 4 years ago

Hi @lanctot , is there someone who has trained A2C under self-play setting on Hanabi using Open Spiel. If so, what are the kind of scores to expect and after how many steps. I have a version that's running but that doesn't seem to be learning anything at all.

lanctot commented 3 years ago

Hi @akileshbadrinaaraayanan , my apologies for the delayed reply -- I got the notification while I was on vacation and then completely forgot about it.

As far as I know, nobody has been able to get this to work using vanilla A2C. But I only know of one other person from a conversation on a reddit thread who tried (unless that was you?) and as far as I remember they experienced similar trouble.

I think for actor-critic to work in Hanabi you probably need more than just vanilla A2C: a decent network architecture, probably recurrent, and population-based training. Take a look at the results in our Hanabi challenge paper, e.g. regarding ACHA: https://arxiv.org/pdf/1902.00506.pdf, the population based training made a huge difference (but even the basic architecture was recurrent and trained on batches as well, in addition it used V-TRACE and batched A2C from IMPALA). ACHA was out-performed later by a distributed recurrent DQN in the SAD paper: https://arxiv.org/pdf/1912.02288.pdf, but similarly, there seem to be a lot of enhancements (distributed replay, recurrent nets, auxiliary tasks, double Q-learning, dueling network, etc.)

Have you tried just standard DQN? When you say "not learning anything", do you mean like... 0ish points, 0-5 points, 5-10 points, or <= 15 points?

lanctot commented 3 years ago

Ah, the person from the reddit thread was indeed @griprox from this thread. :)