Doubt on Q-function loss in AWAC

rail-berkeley / rlkit

Collection of reinforcement learning algorithms

MIT License

2.43k stars 547 forks source link

Doubt on Q-function loss in AWAC #149

Closed Roberto09 closed 2 years ago

Roberto09 commented 2 years ago

Hey, nice work! I think this library and the work on AWAC are very cool. I had a small doubt on something tho, I was wondering about this line:

https://github.com/anair13/rlkit/blob/028885a6528b9d871d1946671f84ad93d90eded1/rlkit/torch/sac/awac_trainer.py#L412

Is it expected to use the rewards obtained from the batch instead than those obtained from advancing the policy with the action it makes at the moment? if so, is there any reference to this in the paper?

Thanks!

anair13 commented 2 years ago

Hmm I'm not sure if I understand the question, but this is basically the standard Bellman backup. From a transition (s, a, r, s') collected in the environment, the Q target [of Q(s, a)] = r + gamma * Q(s', a')

So the reward in the batch is whatever reward the environment returned after being in state s and taking action a.