openai / random-network-distillation

Code for the paper "Exploration by Random Network Distillation"
https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/
881 stars 160 forks source link

RewardForwardFilter to compute intrinsic returns for normalize intrinsic reward #16

Open boscotsang opened 5 years ago

boscotsang commented 5 years ago

In ppo_agent.py, it compute the running estimate of intrinsic returns with rff_int. rffs_int = np.array([self.I.rff_int.update(rew) for rew in self.I.buf_rews_int.T]) In reinforcement learning, returns are computed by sum{\gamma^t r_t}. However in rff_int, it seems that it compute the returns by sum{\gamma^(T-t) r_t) which discounted the reward forward. What's the reason for compute the intrinsic returns forward? Thanks!

4kasha commented 5 years ago

Hi,

According to this comment, it seems just for convenience. Modifying to self.I.buf_rews_int.T[::-1] will not change its std significantly, I think.

alirezakazemipour commented 4 years ago

Exactly. :+1: I think they have made a mistake!!! It must have been self.I.buf_rews_int.T[::-1] as 4kasha has mentioned.