Closed patricio-astudillo closed 4 years ago
Hey @patricio-astudillo,
I need to update the documentation, but no those are not the same. The experiences.log_probs are the log probabilities calculated by worker policies. The learner recomputes a forward pass to calculate its own policy's log probabilities. The learner is checking the similarity of its log probabilities against the worker log probabilities. If the learner / worker policies are too different, gradients are scaled down (v trace algorithm). The policies are different because the worker policies lag behind the learner policy. In practice we don't see any significant policy lag that results in scaling down gradients on our hardware but could be different for others.
https://github.com/heronsystems/adeptRL/blob/759e42094dad745837ba5da9d92843a216bc0108/adept/learner/impala.py#L67
Should r_log_probs_learner be filled with actions or log_probs from the learner?
Isn't r_log_probs_learner = torch.stack(r_log_probs) and r_log_probs_actor = torch.stack(experiences.log_probs) currently the same?