openai / maddpg

Code for the MADDPG algorithm from the paper "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments"
https://arxiv.org/pdf/1706.02275.pdf
MIT License
1.59k stars 484 forks source link

Two problem about update function #32

Open YuanyeMa opened 5 years ago

YuanyeMa commented 5 years ago

I have two problem about update function after read your code. And could anyone explanation it for me? I am very appreciated. Firstly, I can't understand what role does variable "num_sample" play when train q network?

# train q network
num_sample = 1
target_q = 0.0
for i in range(num_sample):
      target_act_next_n = [agents[i].p_debug['target_act'](obs_next_n[i]) for i in range(self.n)]
      target_q_next = self.q_debug['target_q_values'](*(obs_next_n + target_act_next_n))
      target_q += rew + self.args.gamma * (1.0 - done) * target_q_next
target_q /= num_sample

Secondly, why the loss of p should be loss = pg_loss + p_reg * 1e-3, and what role does p_reg play in the loss.

Ah31 commented 5 years ago

p_reg is regularizing p_values obtained from the following:
p = p_func(p_input, int(act_pdtype_n[p_index].param_shape()[0]), scope="p_func", num_units=num_units). You can notice that p_values are being used to compute the pg_loss.

I don't know yet why they have used num_sample because equalling it to 1 does not seem to do anything useful.

KK666-AI commented 4 years ago

@kevin-y-ma @Ah31 Could you reproduce the results. Also, it seem num_sample = 1 is a bug? i think it should be num_sample = len(obs_next_n)?

Justin-Yuan commented 4 years ago

i think it's to take the expectation of the Bellman error target since you need to marginalize over next actions when evaluating the next q value

YuanBoXie commented 2 years ago

@kevin-y-ma @Ah31 Could you reproduce the results. Also, it seem num_sample = 1 is a bug? i think it should be num_sample = len(obs_next_n)?

Sorry, I didn't find it when I first read it. Now I find the problem. In this code, there is a variable i with the same name. num_sample = 1 means experience replay only 1 sample

YuanBoXie commented 2 years ago

p_reg is regularizing p_values obtained from the following: p = p_func(p_input, int(act_pdtype_n[p_index].param_shape()[0]), scope="p_func", num_units=num_units). You can notice that p_values are being used to compute the pg_loss.

I don't know yet why they have used num_sample because equalling it to 1 does not seem to do anything useful.

There is two variables both named i.