Closed Roberto09 closed 2 years ago
Hmm I'm not sure if I understand the question, but this is basically the standard Bellman backup. From a transition (s, a, r, s') collected in the environment, the Q target [of Q(s, a)] = r + gamma * Q(s', a')
So the reward in the batch is whatever reward the environment returned after being in state s and taking action a.
Hey, nice work! I think this library and the work on AWAC are very cool. I had a small doubt on something tho, I was wondering about this line:
https://github.com/anair13/rlkit/blob/028885a6528b9d871d1946671f84ad93d90eded1/rlkit/torch/sac/awac_trainer.py#L412
Is it expected to use the rewards obtained from the batch instead than those obtained from advancing the policy with the action it makes at the moment? if so, is there any reference to this in the paper?
Thanks!