shariqiqbal2810 / MAAC

Code for "Actor-Attention-Critic for Multi-Agent Reinforcement Learning" ICML 2019
MIT License
676 stars 173 forks source link

Critic encoders as shared modules ? #28

Closed jeanibarz closed 3 years ago

jeanibarz commented 4 years ago

Dear Shariq,

In your article "Actor-Attention-Critic for Multi-Agent Reinforcement Learning", figure 1., there is a MLP, which is said to be unique per agent according to the legend (blue background color), that realize a state-action encoding from (o_i, a_i) to e_i.

I believed this corresponds to what is named "critic_encoders" in your code. However, these encoders are listed in the shared_modules here: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L73

Is it normal ? From my understanding, the consequence from being listed in shared_modules is that the gradients are scaled by (1/agents_numbers), so I believe this would have only minor consequences on the behavior of the algorithm.

Best regards

yuchen-x commented 3 years ago

Dear Shariq,

I have the same question as jeanibarz.

Also, in figure 1, the 2nd MLP layer receives the output of the 1st MLP, however, in your code, I assume the 2nd MLP layer is called self.critics:

Screenshot from 2020-12-18 10-31-14

The input of self.critics is an encoding of only observations (called s_encoding) plus other values, rather than the output of 1st MLP layer:

Screenshot from 2020-12-18 10-34-50

Is my understanding correct?

Thanks!

aklein1995 commented 3 years ago

By analizing the code, I intuitively let to this conclusion...

Each agent has a unique encoding function, state_encoder, which contrary to which is said in the paper, the embedding is based only based on the state, e_i: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L52-L59

However, as it is explained in the paper, we also need the embedding of other agents too, e_j, which is done with the self.critics_encoder which it takes both state and action of agents, and that is shared: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L35-L44

Lately, in the forward function, the key and value outputs are discarded for current agent: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L128-L130

Thus, by looking only to the code, I would say that the Figure 1 is not completely correct @jeanibarz

Moreover, @yuchen-x, I think you are right, the 2nd MLP is called self.critics. The input of these second MLP is correct, as the s_encodings refer to those e_i, values and the *other_all_values refer to the x_i: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L111-L112

This is, state_encoder refers to the first MLP, and its output would be e_i, which only takes into account the state. self.citics_encoder takes both state and action into accounts, and is used to get the e_j.

This is not any true evidence, just my self conclusion after taking a look to the code and the paper explanation.

shariqiqbal2810 commented 3 years ago

Hi all, sorry for the confusion. The deviation of the code from the figure is described in the section of the paper entitled "Multi-Agent Advantage Function." Essentially, we want to calculate a Q-value for each possible action such that we can compute an advantage function, so we remove actions from the input and feed the state alone to a separate encoder from which we compute queries. I guess my intention was that the simplified version (with no advantage function) was easier to understand visually in a figure, but I realize how that can be misleading. Hope this clears things up!