Update generate_experience

Currently, this selected the next action randomly, but in q learning it should be using the q value function.

However, we probably don't want to call the q value function too much as it is a NNet call so we can't optimise this using numba.

We need to be able to specify what policy to use for generate_experience.

jsphon / reinforcement_learning