I was trying to use the Adam optimizer for a DQN problem, but the the Q values always kept diverging regardless of hyperparameters. So I looked into the code and it seems to me like the optimizer doesn't correctly update the stateful parameters. It calculates new values for m and v but while doing so it makes copies of the NdArrays. The actual state variables referenced by xs[2] and xs[3] are never updated so on each iteration the optimizer just uses zero initialized values for m and v.
I was trying to use the Adam optimizer for a DQN problem, but the the Q values always kept diverging regardless of hyperparameters. So I looked into the code and it seems to me like the optimizer doesn't correctly update the stateful parameters. It calculates new values for m and v but while doing so it makes copies of the NdArrays. The actual state variables referenced by xs[2] and xs[3] are never updated so on each iteration the optimizer just uses zero initialized values for m and v.