Network weights becoming NaN

I believe the scale_reward=10 should work for multi-direction ant, but you could also try a bit smaller one (e.g. 3). The problem with NaNs is probably the number of gradient steps. Especially the ant-environment seemed to sometimes result in NaNs with too high gradient step count and reward scale. 1 gradient step and reward scale of 10 should be fine. Feel free to re-open this if you see any issues with those.

Btw, the code base haven't yet been updated to the very latest version of the SAC paper, which means that the code is still missing e.g. the double Q-learning trick. This will be fixed soon.

haarnoja / sac

Network weights becoming NaN #9