Closed qlan3 closed 1 year ago
Hello! Good question. The code is inspired from CleanRL's implementation, which itself comes from OpenAI's original implementation.
Costa Huang (author of CleanRL) did an amazing write-up about implementation details here -- In Point 9 of the first section he brings up value function loss clipping! Notably, works investigating it find that it does not help performance, and sometimes can even harm performance. However, I include it for the same reasons that Costa does.
Thank you for your quick and helpful reply!
I notice that the implemented critic loss (https://github.com/luchris429/purejaxrl/blob/main/purejaxrl/ppo.py#L179) in PPO is quite different from traditional TD error, more like PPO's actor loss style. Could you please point me to any reference? If there is no such reference, are there any reasons behind for doing so?