Question on Loss function of Critic Network training

I have never tried to check the plot of Q loss function with time. My interpretation is, we usually expect Q loss function to decrease with time. This holds if we have a perfect supervisor to give us the target value (perhaps expected return). Since we are approximating this with only the Q value at next instant (and also approximating value function with NN), we cannot expect a steady pattern for Q loss function with time. It may fluctuate (sometimes diverges a bit and recover) but eventually the loss decreases.

This implementation does not diverge. Specifically, I found good improvement in terms of the convergence speed after using batch normalization.

From my experience, I used the the following checks to debug the divergence issue.

Set the learning rate to zero (both actor and critic network) and check if it is still diverging. If it does, then there is a divide by zero. (Also check if you have not initialized weights with zeros)
If it is not diverging, it might diverge due to exploding gradient, in that case try to clip the gradient.
(or) Use grad inverter link to bound the parameter. Check the implementation here: https://github.com/stevenpjg/ddpg-aigym/blob/master/tensorflow_grad_inverter.py

stevenpjg / ddpg-aigym

Question on Loss function of Critic Network training #7