heronsystems / adeptRL

Reinforcement learning framework to accelerate research
GNU General Public License v3.0
204 stars 29 forks source link

Learner log probs are the same as actor log probs #70

Closed patricio-astudillo closed 4 years ago

patricio-astudillo commented 4 years ago

https://github.com/heronsystems/adeptRL/blob/759e42094dad745837ba5da9d92843a216bc0108/adept/learner/impala.py#L67

Should r_log_probs_learner be filled with actions or log_probs from the learner?

Isn't r_log_probs_learner = torch.stack(r_log_probs) and r_log_probs_actor = torch.stack(experiences.log_probs) currently the same?

jtatusko commented 4 years ago

Hey @patricio-astudillo,

I need to update the documentation, but no those are not the same. The experiences.log_probs are the log probabilities calculated by worker policies. The learner recomputes a forward pass to calculate its own policy's log probabilities. The learner is checking the similarity of its log probabilities against the worker log probabilities. If the learner / worker policies are too different, gradients are scaled down (v trace algorithm). The policies are different because the worker policies lag behind the learner policy. In practice we don't see any significant policy lag that results in scaling down gradients on our hardware but could be different for others.