Closed navneet-nmk closed 5 years ago
It's correct. The X_r_hat (predictor network) is expected to be optimized to approximate the X_r (random target network) during training.
Yes. But optimized separately with the self.aux_loss right. Why would we want the policy gradients (through the use of self.int_rew) to flow through the predictor network?
It doesn't matter, as int_rew is not included in the grads. For aux_loss, it has to be tf.stop_gradient(X_r) - X_r_hat
I am not sure whether this is an error per se but in the policy implementation where the RND bonus and the rewards prediction bonus are calculated, shouldn't we use a tf.stop_gradient for the variable X_r_hat when calculating the self.int_rew.
If we calculate the reward this way, won't the gradients of the policy also flow through X_r_hat which is not ideal, right?
self.int_rew = tf.reduce_mean(tf.square(tf.stop_gradient(X_r) - X_r_hat), axis=-1, keep_dims=True)