shariqiqbal2810 / MAAC

Code for "Actor-Attention-Critic for Multi-Agent Reinforcement Learning" ICML 2019
MIT License
666 stars 172 forks source link

How does the gradient back-propagate from Q to the action $a_i$? #26

Open xihuai18 opened 4 years ago

xihuai18 commented 4 years ago

I wonder how the gradient back propagate from Q to $a_i$. Trace from Q: https://github.com/shariqiqbal2810/MAAC/blob/105d60ede9a3d935fcc82bcb644421626b5d6493/utils/critics.py#L149-L150 Then trace critic_in: https://github.com/shariqiqbal2810/MAAC/blob/105d60ede9a3d935fcc82bcb644421626b5d6493/utils/critics.py#L148 Since s_encoding doesn't contain input from $a_i$, I then trace other_all_values[i]: https://github.com/shariqiqbal2810/MAAC/blob/105d60ede9a3d935fcc82bcb644421626b5d6493/utils/critics.py#L125-L141 keys and values don't contain agent i's action as input, and selector uses only observations as input: https://github.com/shariqiqbal2810/MAAC/blob/105d60ede9a3d935fcc82bcb644421626b5d6493/utils/critics.py#L118-L119

So, is there gradient from Q to action $a_i$?

DokinCui commented 1 year ago

keys and values contain the agent i's action since their input is "sa_encoding", but the selector uses only observations as input, I can't understand. And for the function of "s_encoding", I also can't understand, because only "sa_encoding" is used in the paper, but not "s_encoding".

zhl606 commented 1 year ago

keys and values contain the agent i's action since their input is "sa_encoding", but the selector uses only observations as input, I can't understand. And for the function of "s_encoding", I also can't understand, because only "sa_encoding" is used in the paper, but not "s_encoding".

I also have the same question, have you understood it? And what I want to know is, in the PPO algorithm, when estimating the advantage function, do we only need input state information and not action information, so can we use s_encoding without using sa_encoding?