Learner log probs are the same as actor log probs

heronsystems / adeptRL

Reinforcement learning framework to accelerate research

GNU General Public License v3.0

204 stars 29 forks source link

Hey @patricio-astudillo,

I need to update the documentation, but no those are not the same. The experiences.log_probs are the log probabilities calculated by worker policies. The learner recomputes a forward pass to calculate its own policy's log probabilities. The learner is checking the similarity of its log probabilities against the worker log probabilities. If the learner / worker policies are too different, gradients are scaled down (v trace algorithm). The policies are different because the worker policies lag behind the learner policy. In practice we don't see any significant policy lag that results in scaling down gradients on our hardware but could be different for others.

heronsystems / adeptRL

Learner log probs are the same as actor log probs #70