I think PPO deterministic is implemented incorrect

https://github.com/google/brax/blob/ff3ff641097699703087e1dc0a7b6e8305d78270/brax/training/agents/ppo/networks.py#L46

I noticed something weird when comparing the mean of many sampled actions vs the deterministic action. I would expect from a distribution that the deterministic value would be almost equal to the mean of samples.

I found that the action created at line 46 should also be postprocessed if I am not mistaken.

However this sadly still doesn't give the expected result. I created a test plot to show the difference: Tanh(x) is the post processing step which still doesn't end up being the same as the mean.

So I think something is incorrect but I haven't found the fix yet.

google / brax

I think PPO deterministic is implemented incorrect #339