Open smorad opened 1 year ago
Yes, I found overestimation and also gradient explosion when training LSTM TD3 in some hard environments like Walker-V. A simple remedy may be add gradient clipping to avoid explosion, although I don't expect this can fix the issue.
I'm rerunning velocity baselines in the POMDP directory and I'm observing exploding Q values fairly often. I was wondering if this is something you experienced during training. TD3 seems to avoid overestimation bias but the returns seem low. Any tips to get more stable returns across trials without massive batch sizes?