Open YuanyeMa opened 5 years ago
p_reg is regularizing p_values obtained from the following:
p = p_func(p_input, int(act_pdtype_n[p_index].param_shape()[0]), scope="p_func", num_units=num_units)
.
You can notice that p_values are being used to compute the pg_loss.
I don't know yet why they have used num_sample because equalling it to 1 does not seem to do anything useful.
@kevin-y-ma @Ah31 Could you reproduce the results. Also, it seem num_sample = 1
is a bug? i think it should be num_sample = len(obs_next_n)
?
i think it's to take the expectation of the Bellman error target since you need to marginalize over next actions when evaluating the next q value
@kevin-y-ma @Ah31 Could you reproduce the results. Also, it seem
num_sample = 1
is a bug? i think it should benum_sample = len(obs_next_n)
?
Sorry, I didn't find it when I first read it. Now I find the problem. In this code, there is a variable i
with the same name.
num_sample = 1 means experience replay only 1 sample
p_reg is regularizing p_values obtained from the following:
p = p_func(p_input, int(act_pdtype_n[p_index].param_shape()[0]), scope="p_func", num_units=num_units)
. You can notice that p_values are being used to compute the pg_loss.I don't know yet why they have used num_sample because equalling it to 1 does not seem to do anything useful.
There is two variables both named i
.
I have two problem about update function after read your code. And could anyone explanation it for me? I am very appreciated. Firstly, I can't understand what role does variable "num_sample" play when train q network?
Secondly, why the loss of p should be
loss = pg_loss + p_reg * 1e-3
, and what role doesp_reg
play in the loss.