well, somehow it appears to be working

I remember now that training an RNN with a bootstrapping method like SARSA is potentially a little difficult, because technically you're supposed to take a gradient step every iteration, but then the gradient depends on recent time steps (because the RNN has a hidden state), which means your gradient steps are correlated over time.

Nevertheless, it appears to be...learning?

This is where I take a gradient step after every episode, which is 20 concatenated trials.

mobeets / sarsa-rnn

well, somehow it appears to be working #1