riccardodv / MirrorRL

2 stars 0 forks source link

bug #1

Open riccardodv opened 2 years ago

riccardodv commented 2 years ago

https://github.com/riccardodv/MirrorRL/blob/086a7361d15ad5412c102278a257c38030198e3c/rl_tools.py#L56-L68

I think is actually nact = self.policy(nobs). Do you agree?

AleShi94 commented 2 years ago

Yes, I think you are right!

riccardodv commented 2 years ago

I changed it and things work at least as well as before but I got the following warning which might be unrelated though: rl_tools.py:33: UserWarning: reached horizon, and terminal is suspiciously True. Check if _max_episode_steps has the expected behavior in your environment

akrouriad commented 2 years ago

Yes thanks!

AleShi94 commented 2 years ago

https://github.com/riccardodv/MirrorRL/blob/b7830390561630ca33fc8c4563d4ec45895a28a2/cascade_mirror_rl_brm.py#L97-L98

Shouldn't it be ap in the index argument of gather? If I understand correctly it should correspond to evaluation of the q-value function in the next state and next action. @akrouriad

akrouriad commented 2 years ago

newqs is residual of q at s and uses a, and newqsp is residual at s' which uses a'.

I was not using a' at all before (instead evaluating V(s') by weighting every Q(s', .) with the policy probability pi(. | s')). But after Riccardo fixed the bug, the last commit uses Q(s', a'), which I think is more realistic because close to convergence we cannot really count on the Q function to be accurate for all actions since we explore less and less.