PPO pytorch implementation question

amjass12 commented 1 year ago

Hi, Thank you so much for this guide - it is extremely clear and easy to follow! This isn't a bug, but there are a few questions I have. The first: Why are the 'values' tensor sent to the actor device (referring to line):

values = T.tensor(values).to(self.actor.device)

The values tensor is not uniquely used by the actor parameters but by both the actor and critic as it is used for the MSE for critic and then added to actor loss for the total loss

and second: why did you not include KL divergence in your implementation, was there a specific reason?

thank you so much again!

philtabor commented 1 year ago

Good questions. 1) The actor device and critic device refer to the same thing. It's just a different variable name that refers to the same physical device in your computer. 2) I left out the KL divergence because, from my initial reading of the paper, the performance using the clipped loss was generally better.

amjass12 commented 1 year ago

Thank you so much for the quick response and clear explanations!

philtabor / Youtube-Code-Repository

PPO pytorch implementation question #61