Closed acohen13 closed 6 years ago
I believe the scale_reward=10
should work for multi-direction ant, but you could also try a bit smaller one (e.g. 3). The problem with NaNs is probably the number of gradient steps. Especially the ant-environment seemed to sometimes result in NaNs with too high gradient step count and reward scale. 1 gradient step and reward scale of 10 should be fine. Feel free to re-open this if you see any issues with those.
Btw, the code base haven't yet been updated to the very latest version of the SAC paper, which means that the code is still missing e.g. the double Q-learning trick. This will be fixed soon.
When running with default parameter settings (except 16 elements in the GMM, 4 gradient steps per iteration) on the Ant domain and multi-direction task, the network weights become NaN. Is there a learning rate or reward scale I should be using instead of the default?