Potential bug in PPO+RND?

vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

http://docs.cleanrl.dev

Other

4.91k stars 566 forks source link

Potential bug in PPO+RND? #416

Closed roger-creus closed 4 months ago

roger-creus commented 10 months ago

In ppo_rnd_envpool.py why is line 368:

predict_next_feature = rnd_model.predictor(rnd_next_obs)

not under the torch.no_grad() context?

Like the policy, the RND network should compute no gradients during rollout collection. I could be wrong though, I just wanted to make sure.

Thanks!

roger-creus commented 10 months ago

@yooceii

yooceii commented 10 months ago

Hi Roger, the predict_next_feature itself is indeed being tracked, but it's only been used in curiosity_rewards which only uses .data of the variable, it is equivalent to requires_grad=False or torch.no_grad() which means we do not collect grad during the rollout. Does it answer your question?