Open xihuai18 opened 4 years ago
keys and values contain the agent i's action since their input is "sa_encoding", but the selector uses only observations as input, I can't understand. And for the function of "s_encoding", I also can't understand, because only "sa_encoding" is used in the paper, but not "s_encoding".
keys and values contain the agent i's action since their input is "sa_encoding", but the selector uses only observations as input, I can't understand. And for the function of "s_encoding", I also can't understand, because only "sa_encoding" is used in the paper, but not "s_encoding".
I also have the same question, have you understood it? And what I want to know is, in the PPO algorithm, when estimating the advantage function, do we only need input state information and not action information, so can we use s_encoding without using sa_encoding?
I wonder how the gradient back propagate from Q to $a_i$. Trace from Q: https://github.com/shariqiqbal2810/MAAC/blob/105d60ede9a3d935fcc82bcb644421626b5d6493/utils/critics.py#L149-L150 Then trace
critic_in
: https://github.com/shariqiqbal2810/MAAC/blob/105d60ede9a3d935fcc82bcb644421626b5d6493/utils/critics.py#L148 Sinces_encoding
doesn't contain input from $a_i$, I then traceother_all_values[i]
: https://github.com/shariqiqbal2810/MAAC/blob/105d60ede9a3d935fcc82bcb644421626b5d6493/utils/critics.py#L125-L141keys
andvalues
don't contain agent i's action as input, andselector
uses only observations as input: https://github.com/shariqiqbal2810/MAAC/blob/105d60ede9a3d935fcc82bcb644421626b5d6493/utils/critics.py#L118-L119So, is there gradient from Q to action $a_i$?