openai / random-network-distillation

Code for the paper "Exploration by Random Network Distillation"
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
881 stars 160 forks source link

tf.stop_gradient missing? #11

Closed navneet-nmk closed 5 years ago

navneet-nmk commented 6 years ago

I am not sure whether this is an error per se but in the policy implementation where the RND bonus and the rewards prediction bonus are calculated, shouldn't we use a tf.stop_gradient for the variable X_r_hat when calculating the self.int_rew.

If we calculate the reward this way, won't the gradients of the policy also flow through X_r_hat which is not ideal, right?

self.int_rew = tf.reduce_mean(tf.square(tf.stop_gradient(X_r) - X_r_hat), axis=-1, keep_dims=True)

siyuhuang commented 6 years ago

It's correct. The X_r_hat (predictor network) is expected to be optimized to approximate the X_r (random target network) during training.

navneet-nmk commented 5 years ago

Yes. But optimized separately with the self.aux_loss right. Why would we want the policy gradients (through the use of self.int_rew) to flow through the predictor network?

siyuhuang commented 5 years ago

It doesn't matter, as int_rew is not included in the grads. For aux_loss, it has to be tf.stop_gradient(X_r) - X_r_hat