Closed jsphon closed 7 years ago
Currently, this selected the next action randomly, but in q learning it should be using the q value function.
However, we probably don't want to call the q value function too much as it is a NNet call so we can't optimise this using numba.
We need to be able to specify what policy to use for generate_experience.
Currently, this selected the next action randomly, but in q learning it should be using the q value function.
However, we probably don't want to call the q value function too much as it is a NNet call so we can't optimise this using numba.
We need to be able to specify what policy to use for generate_experience.