google / brax

Massively parallel rigidbody physics simulation on accelerator hardware.
Apache License 2.0
2.25k stars 249 forks source link

I think PPO deterministic is implemented incorrect #339

Closed gijskoning closed 1 year ago

gijskoning commented 1 year ago

https://github.com/google/brax/blob/ff3ff641097699703087e1dc0a7b6e8305d78270/brax/training/agents/ppo/networks.py#L46

I noticed something weird when comparing the mean of many sampled actions vs the deterministic action. I would expect from a distribution that the deterministic value would be almost equal to the mean of samples.

I found that the action created at line 46 should also be postprocessed if I am not mistaken.

However this sadly still doesn't give the expected result. I created a test plot to show the difference: Tanh(x) is the post processing step which still doesn't end up being the same as the mean. image

So I think something is incorrect but I haven't found the fix yet.

gijskoning commented 1 year ago

Nevermind, it only had to something with my policies model. The code in brax is correct, I noticed that the postprocessing is done in the method that outputs the deterministic action.