Open riccardodv opened 2 years ago
Yes, I think you are right!
I changed it and things work at least as well as before but I got the following warning which might be unrelated though:
rl_tools.py:33: UserWarning: reached horizon, and terminal is suspiciously True. Check if _max_episode_steps has the expected behavior in your environment
Yes thanks!
Shouldn't it be ap in the index argument of gather? If I understand correctly it should correspond to evaluation of the q-value function in the next state and next action. @akrouriad
newqs is residual of q at s and uses a, and newqsp is residual at s' which uses a'.
I was not using a' at all before (instead evaluating V(s') by weighting every Q(s', .) with the policy probability pi(. | s')). But after Riccardo fixed the bug, the last commit uses Q(s', a'), which I think is more realistic because close to convergence we cannot really count on the Q function to be accurate for all actions since we explore less and less.
https://github.com/riccardodv/MirrorRL/blob/086a7361d15ad5412c102278a257c38030198e3c/rl_tools.py#L56-L68
I think is actually
nact = self.policy(nobs)
. Do you agree?