Closed jeanibarz closed 3 years ago
Dear Shariq,
I have the same question as jeanibarz.
Also, in figure 1, the 2nd MLP layer receives the output of the 1st MLP, however, in your code, I assume the 2nd MLP layer is called self.critics:
The input of self.critics is an encoding of only observations (called s_encoding) plus other values, rather than the output of 1st MLP layer:
Is my understanding correct?
Thanks!
By analizing the code, I intuitively let to this conclusion...
Each agent has a unique encoding function, state_encoder, which contrary to which is said in the paper, the embedding is based only based on the state, e_i: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L52-L59
However, as it is explained in the paper, we also need the embedding of other agents too, e_j, which is done with the self.critics_encoder which it takes both state and action of agents, and that is shared: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L35-L44
Lately, in the forward function, the key and value outputs are discarded for current agent: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L128-L130
Thus, by looking only to the code, I would say that the Figure 1 is not completely correct @jeanibarz
Moreover, @yuchen-x, I think you are right, the 2nd MLP is called self.critics. The input of these second MLP is correct, as the s_encodings refer to those e_i, values and the *other_all_values refer to the x_i: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L111-L112
This is, state_encoder refers to the first MLP, and its output would be e_i, which only takes into account the state. self.citics_encoder takes both state and action into accounts, and is used to get the e_j.
This is not any true evidence, just my self conclusion after taking a look to the code and the paper explanation.
Hi all, sorry for the confusion. The deviation of the code from the figure is described in the section of the paper entitled "Multi-Agent Advantage Function." Essentially, we want to calculate a Q-value for each possible action such that we can compute an advantage function, so we remove actions from the input and feed the state alone to a separate encoder from which we compute queries. I guess my intention was that the simplified version (with no advantage function) was easier to understand visually in a figure, but I realize how that can be misleading. Hope this clears things up!
Dear Shariq,
In your article "Actor-Attention-Critic for Multi-Agent Reinforcement Learning", figure 1., there is a MLP, which is said to be unique per agent according to the legend (blue background color), that realize a state-action encoding from (o_i, a_i) to e_i.
I believed this corresponds to what is named "critic_encoders" in your code. However, these encoders are listed in the shared_modules here: https://github.com/shariqiqbal2810/MAAC/blob/6174a01251251e6778c4ada26bc8d9cd930e3856/utils/critics.py#L73
Is it normal ? From my understanding, the consequence from being listed in shared_modules is that the gradients are scaled by (1/agents_numbers), so I believe this would have only minor consequences on the behavior of the algorithm.
Best regards