Closed veviurko closed 4 years ago
Thanks for the report and the clear reproduction! We agree the behaviour isn't right and we are discussing how best to fix it. This may take us a couple of days, but we'll be as quick as we can.
On Mon, Sep 23, 2019 at 3:47 PM griprox notifications@github.com wrote:
Hello, I was checking policy gradient algorithm on HLE. Firstly, I think there is something wrong with how Hanabi handles rewards (or I am missing something). I ran this simple code (similar to PG for poker example) and looked how agents save rewards. I made discount = 0, so agent._dataset['returns'] is a list of immediate rewards.
from future import absolute_importfrom future import divisionfrom future import print_functionimport tensorflow as tffrom open_spiel.python import policyfrom open_spiel.python import rl_environmentfrom open_spiel.python.algorithms import policy_gradient
game = "hanabi" num_players = 2 discount =0 env_configs = {"players": num_players, 'max_life_tokens' : 1, 'colors' : 2, 'ranks' : 5, 'hand_size' : 2, 'max_information_tokens' : 3, 'discount' : discount} env = rl_environment.Environment(game, **env_configs) info_state_size = env.observation_spec()["info_state"][0] num_actions = env.action_spec()["num_actions"] with tf.Session() as sess: agents = [policy_gradient.PolicyGradient(sess, idx, info_state_size, num_actions, hidden_layers_sizes=(128,)) for idx in range(num_players)]
sess.run(tf.global_variables_initializer()) for ep in range(1): time_step = env.reset() while not time_step.last(): player_id = time_step.observations["current_player"] print('Player %d' % player_id) fireworks_before = env._state.observation()[30:47] agent_output = agents[player_id].step(time_step) action_list = [agent_output.action] time_step = env.step(action_list) if not time_step.last(): fireworks_after = env._state.observation()[30:47] else: fireworks_after = 'lost' print(fireworks_before, '-->', fireworks_after, '\n') for agent in agents: agent.step(time_step)print('\n')print('Agent 0 rewards history:')print(agents[0]._dataset['returns'])print('Agent 1 rewards history:')print(agents[1]._dataset['returns'])
This is an example of weird output.
P0 get 0pt, then P1 gets 1pt and then P0 loses. As I understand, P0 should have rewards [1, -1] which is correct. However P1 should have [0] since he scored a point and right after P0 lost it.
Player0 Fireworks: R0 Y0 --> Fireworks: R0 Y0
Player 1 Fireworks: R0 Y0 --> Fireworks: R0 Y1
Player 0 Fireworks: R0 Y1 --> lost
Agent 0 rewards history: [1.0, -1.0] Agent 1 rewards history: [-1.0]
Second question is concerning self-play. I would like to run PG agent in a self-play there it plays with itself (not similar agent.) All examples I found in open_spiel have several copies of agents playing with each other and learning separately. As I understand, using one agent to play with itself in current implementation will not work since episode experience of different players will mix up.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/deepmind/open_spiel/issues/68?email_source=notifications&email_token=AHAF7THK47GHPIR67IU35M3QLDJIZA5CNFSM4IZL2DZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNB2RTA, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAF7TGYGMUBKEHRXPUZCDTQLDJIZANCNFSM4IZL2DZA .
Thanks for the report and the clear reproduction! We agree the behaviour isn't right and we are discussing how best to fix it. This may take us a couple of days, but we'll be as quick as we can. … On Mon, Sep 23, 2019 at 3:47 PM griprox @.> wrote: Hello, I was checking policy gradient algorithm on HLE. Firstly, I think there is something wrong with how Hanabi handles rewards (or I am missing something). I ran this simple code (similar to PG for poker example) and looked how agents save rewards. I made discount = 0, so agent._dataset['returns'] is a list of immediate rewards. from future import absolute_importfrom future import divisionfrom future import print_functionimport tensorflow as tffrom open_spiel.python import policyfrom open_spiel.python import rl_environmentfrom open_spiel.python.algorithms import policy_gradient game = "hanabi" num_players = 2 discount =0 env_configs = {"players": num_players, 'max_life_tokens' : 1, 'colors' : 2, 'ranks' : 5, 'hand_size' : 2, 'max_information_tokens' : 3, 'discount' : discount} env = rl_environment.Environment(game, env_configs) info_state_size = env.observation_spec()["info_state"][0] num_actions = env.action_spec()["num_actions"] with tf.Session() as sess: agents = [policy_gradient.PolicyGradient(sess, idx, info_state_size, num_actions, hidden_layers_sizes=(128,)) for idx in range(num_players)] sess.run(tf.global_variables_initializer()) for ep in range(1): time_step = env.reset() while not time_step.last(): player_id = time_step.observations["current_player"] print('Player %d' % player_id) fireworks_before = env._state.observation()[30:47] agent_output = agents[player_id].step(time_step) action_list = [agent_output.action] time_step = env.step(action_list) if not time_step.last(): fireworks_after = env._state.observation()[30:47] else: fireworks_after = 'lost' print(fireworks_before, '-->', fireworks_after, '\n') for agent in agents: agent.step(time_step)print('\n')print('Agent 0 rewards history:')print(agents[0]._dataset['returns'])print('Agent 1 rewards history:')print(agents[1]._dataset['returns']) This is an example of weird output. P0 get 0pt, then P1 gets 1pt and then P0 loses. As I understand, P0 should have rewards [1, -1] which is correct. However P1 should have [0] since he scored a point and right after P0 lost it. Player0 Fireworks: R0 Y0 --> Fireworks: R0 Y0 Player 1 Fireworks: R0 Y0 --> Fireworks: R0 Y1 Player 0 Fireworks: R0 Y1 --> lost Agent 0 rewards history: [1.0, -1.0] Agent 1 rewards history:* [-1.0] Second question is concerning self-play. I would like to run PG agent in a self-play there it plays with itself (not similar agent.) All examples I found in open_spiel have several copies of agents playing with each other and learning separately. As I understand, using one agent to play with itself in current implementation will not work since episode experience of different players will mix up. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#68?email_source=notifications&email_token=AHAF7THK47GHPIR67IU35M3QLDJIZA5CNFSM4IZL2DZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNB2RTA>, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAF7TGYGMUBKEHRXPUZCDTQLDJIZANCNFSM4IZL2DZA .
Thank you!
Our chosen solution is to require agent.step() to be called for every environment step on which there may be a reward, not just the ones at which it is the agent's turn to play. This requires a slight modification to the agent implementation, coming later today.
Our chosen solution is to require agent.step() to be called for every environment step on which there may be a reward, not just the ones at which it is the agent's turn to play. This requires a slight modification to the agent implementation, coming later today.
Hello, it seems that your solution is still not working. After few tests I got following example: Player 0 Fireworks: R0 Y0 --> Fireworks: R0 Y0
Player 1 Fireworks: R0 Y0 --> Fireworks: R0 Y0
Player 0 Fireworks: R0 Y0 --> Fireworks: R0 Y1
Player 1 Fireworks: R0 Y1 --> Fireworks: R0 Y1
Player 0 Fireworks: R0 Y1 --> Fireworks: R0 Y1
Player 1 Fireworks: R0 Y1 --> Fireworks: R0 Y1
Player 0 Fireworks: R0 Y1 --> lost
Agent 0 rewards history: [0.0, 0.0, 0.0, -1.0] Agent 1 rewards history: [1.0, 0.0, -1.0]
In general, player who plays card does not save the reward into his buffer. Also, new implementation of PolicyGradient contains a bug, line 256 is
self.player_id == time_step.current_player()):
it raises error saying that time_step does not have .current_player() method. I guess the line should be
self.player_id == time_step.observations['current_player']):
Did you update your main loop to call step() for every player on every timestep? e.g. agent_output = [agent.step(time_step) for agent in agents] time_step = env.step([agent_output[current_player].action])
Are you sure you have updated the whole repository? current_player has been added here: https://github.com/deepmind/open_spiel/blob/71d76631feb2e62ee1b90421acf981ee11a28fce/open_spiel/python/rl_environment.py#L97
You are right, I did not update the main loop. However, now I did, and there is another problem. I am running code from below and it looks that now agent saves reward each step it takes -- even on those when it did None action.
info_state_size = env.observation_spec()["info_state"][0]
num_actions = env.action_spec()["num_actions"]
with tf.Session() as sess:
agents = [policy_gradient.PolicyGradient(sess,player_id=player_id,info_state_size=info_state_size,
num_actions=num_actions,hidden_layers_sizes=[8])
for player_id in range(num_players)]
sess.run(tf.global_variables_initializer())
time_step = env.reset()
while not time_step.last():
current_player = time_step.observations["current_player"]
fireworks_before = env._state.observation()[30:47]
agent_output = [agent.step(time_step) for agent in agents]
time_step = env.step([agent_output[current_player].action])
if not time_step.last():
fireworks_after = env._state.observation()[30:47]
else:
fireworks_after = 'lost'
print(fireworks_before, '-->', fireworks_after, '\n')
for agent in agents:
agent.step(time_step)
print('\n')
print('Agent 0 rewards history:')
print(agents[0]._dataset['returns'])
print(agents[0]._dataset['actions'])
print('Agent 1 rewards history:')
print(agents[1]._dataset['returns'])
print(agents[1]._dataset['actions'])
Fireworks: R0 Y0 --> Fireworks: R0 Y0
Fireworks: R0 Y0 --> Fireworks: R0 Y0
Fireworks: R0 Y0 --> Fireworks: R0 Y0
Fireworks: R0 Y0 --> Fireworks: R1 Y0
Fireworks: R1 Y0 --> Fireworks: R1 Y1
Fireworks: R1 Y1 --> lost
Agent 0 rewards history: [0.0, 0.0, 0.0, 1.0, 1.0, -2.0] [5, None, 4, None, 3, None] Agent 1 rewards history: [0.0, 0.0, 0.0, 1.0, 1.0, -2.0] [None, 6, None, 2, None, 3]
Great! Glad it's now running okay. The behaviour you describe is working as intended. Both agents see the same (complete) reward history. Does this cause a problem?
It seems very weird. May be I misunderstand how you approach learning in the multiplayer games in general. As far as I understand, there are two options:
Now, it seems that what agent saves in its experience now is not what it should. For an agent all other players are basically part of the environment, so its interaction with it looks as follows: -- Agent makes action (not None action, the real one) -- Say, immediate reward after the action is r0 -- If episode is terminated, then agent just saves the reward -- If episode is not terminated, agent waits and accumulates the rewards -- Each next player i makes action, receives immediate reward r_i. Original 0 agent's reward r0 is then changed as r0 += r_i. -- If episode is finished after player i made action, then original agent 0 saves accumulated reward -- If episode does not finish until next agent 0 turn, then it accumulates reward from all other agents and saves it in the beginning of its next turn. Currently, agent just saves all differences in the game's score and I do not really understand why. Also, learning does not work with None actions. To check it you shall change batch_size to 2, for example, in policy_gradient_test. Correct me if I misunderstand something,
I replied a bit longer on the reddit thread that I won't copy here, but basically these are two separate paradigms (both are used in different contexts in MARL).
1. Each player is essentially an individual network and learns only according to his own experience. I though that it is the case on open_spiel. 2. There is one network for all players, and each player just collects batch of experience. Then. all players' experience is used to make gradient step. How I understood the original Hanabi paper (and actually github/HLE has an example of Rainbow agent which support my idea) used 2nd approach, but I think that 1st is also plausible.
These are two known paradigms in the field of multiagent RL (MARL). 1 is known as independent RL and 2 is known as self-play learning. (1) is maybe more common in the MARL literature more generally whereas the setup the Hanabi paper took (and AlphaZero and TD-Gammon) which is more common in games is (2).
The RL agents such as policy gradients and DQN, interact through the rl_environment (this is partly for historical reasons). When you do that, you normally assume the independent RL setup, because each player is a separate agent and they both think of their environment from their own personal perspective (treating everything else as the environment).
Currently, agent just saves all differences in the game's score and I do not really understand why. Also, learning does not work with None actions. To check it you shall change batch_size to 2, for example, in policy_gradient_test.
The definition of the reward was a choice that we made in the Hanabi Challenge paper and corresponding Hanabi Learning Environment. The reason is that the environment should give a reward signal when a card is scored (since the agent knows this as it happens).
Correct me if I misunderstand something,
You're not wrong. The example you listed out exactly highlights the difference in independent RL and self-play. It is certainly a different setting. To get self-play, you should be able to simply sample data from a single policy, and then go back and loop through all the players and train a single model based on the data from all players.
Hi @griprox , here is an alternative solution you might consider to get self-play working, which involves only a couple of changes to the main loop:
.step
as usual. The other agents should use .step with is_evaluation=True
, so they don't update their networks or buffers.Some considerations:
@fzvinicius @lanctot Thank you very much for your responses! I am currently playing with self-play setup of my own implementation, but I hope to try independent agents using open_spiel implementation soon.
Hi @griprox did you have any success in training A2C agents under self-play setting on Hanabi either in your own implementation or using Open Spiel? If so, do you mind sharing the results?
Hi @lanctot , is there someone who has trained A2C under self-play setting on Hanabi using Open Spiel. If so, what are the kind of scores to expect and after how many steps. I have a version that's running but that doesn't seem to be learning anything at all.
Hi @akileshbadrinaaraayanan , my apologies for the delayed reply -- I got the notification while I was on vacation and then completely forgot about it.
As far as I know, nobody has been able to get this to work using vanilla A2C. But I only know of one other person from a conversation on a reddit thread who tried (unless that was you?) and as far as I remember they experienced similar trouble.
I think for actor-critic to work in Hanabi you probably need more than just vanilla A2C: a decent network architecture, probably recurrent, and population-based training. Take a look at the results in our Hanabi challenge paper, e.g. regarding ACHA: https://arxiv.org/pdf/1902.00506.pdf, the population based training made a huge difference (but even the basic architecture was recurrent and trained on batches as well, in addition it used V-TRACE and batched A2C from IMPALA). ACHA was out-performed later by a distributed recurrent DQN in the SAD paper: https://arxiv.org/pdf/1912.02288.pdf, but similarly, there seem to be a lot of enhancements (distributed replay, recurrent nets, auxiliary tasks, double Q-learning, dueling network, etc.)
Have you tried just standard DQN? When you say "not learning anything", do you mean like... 0ish points, 0-5 points, 5-10 points, or <= 15 points?
Ah, the person from the reddit thread was indeed @griprox from this thread. :)
Hello, I was checking policy gradient algorithm on HLE. Firstly, I think there is something wrong with how Hanabi handles rewards (or I am missing something). I ran this simple code (similar to PG for poker example) and looked how agents save rewards. I made discount = 0, so
agent._dataset['returns']
is a list of immediate rewards.This is an example of weird output.
P0 get 0pt, then P1 gets 1pt and then P0 loses. As I understand, P0 should have rewards [1, -1] which is correct. However P1 should have [0] since he scored a point and right after P0 lost it.
Player0 Fireworks: R0 Y0 --> Fireworks: R0 Y0
Player 1 Fireworks: R0 Y0 --> Fireworks: R0 Y1
Player 0 Fireworks: R0 Y1 --> lost
Agent 0 rewards history: [1.0, -1.0] Agent 1 rewards history: [-1.0]
Second question is concerning self-play. I would like to run PG agent in a self-play there it plays with itself (not similar agent.) All examples I found in open_spiel have several copies of agents playing with each other and learning separately. As I understand, using one agent to play with itself in current implementation will not work since episode experience of different players will mix up.