Closed boscotsang closed 5 years ago
I understand that the code is a bit hard to understand.
First of all, the derivative of 1) 0.5 * (V - R)^2
w.r.t. V is equivalent to the derivative of 2) V * stop_gradient(V - R)
.
The clipping is used to avoid too large gradients (by clipping gradients), which is equivalent to using a form of huber loss.
The use of huber loss is not described in the paper for brevity.
Does this make sense?
Thanks for your reply and it make sense a lot.
In the paper, sil value loss is defined as 0.5 max(0, (R-V))^2. Howerver in the code, the value loss is defined as below `self.vf_loss = tf.reduce_sum(self.W v_estimate tf.stop_gradient(delta)) / self.num_samples` which means that the value loss is 0.5 V * clip((V-R), -5, 0). What's the advantage of this implementation. Thanks