junhyukoh / self-imitation-learning

ICML 2018 Self-Imitation Learning
MIT License
274 stars 41 forks source link

SIL Value update #3

Closed boscotsang closed 5 years ago

boscotsang commented 6 years ago

In the paper, sil value loss is defined as 0.5 max(0, (R-V))^2. Howerver in the code, the value loss is defined as below `self.vf_loss = tf.reduce_sum(self.W v_estimate tf.stop_gradient(delta)) / self.num_samples` which means that the value loss is 0.5 V * clip((V-R), -5, 0). What's the advantage of this implementation. Thanks

junhyukoh commented 5 years ago

I understand that the code is a bit hard to understand. First of all, the derivative of 1) 0.5 * (V - R)^2 w.r.t. V is equivalent to the derivative of 2) V * stop_gradient(V - R). The clipping is used to avoid too large gradients (by clipping gradients), which is equivalent to using a form of huber loss.
The use of huber loss is not described in the paper for brevity.

Does this make sense?

boscotsang commented 5 years ago

Thanks for your reply and it make sense a lot.