Perhaps this is because the loss calculation is a full graph flowing from states to action values to targets. And fiddling the action value's weights is changing too much.
The tests in value_function_tests show that we are able to reproduce the desired target function with all sorts of nnet configurations, so there is no reason why we can't train the net successfully.
Perhaps we could do the training in 2 steps again...
1 Create the targets
2 Train the NNet towards the target
The current method just trains the nnet towards the target. So during evaluation of different weights, in order to calculate the gradient, the target is changing.
Convergence is too precarious.
Perhaps this is because the loss calculation is a full graph flowing from states to action values to targets. And fiddling the action value's weights is changing too much.
The tests in value_function_tests show that we are able to reproduce the desired target function with all sorts of nnet configurations, so there is no reason why we can't train the net successfully.
Perhaps we could do the training in 2 steps again...
1 Create the targets 2 Train the NNet towards the target
The current method just trains the nnet towards the target. So during evaluation of different weights, in order to calculate the gradient, the target is changing.