raisimTech / raisimLib

Visit www.raisim.com
http://www.raisim.com
Other
341 stars 91 forks source link

Definition of the advantage function in the runner.py #410

Closed ClaudioChiariello closed 1 year ago

ClaudioChiariello commented 1 year ago

Hello everyone, I am trying to understand this portion of code for days. In the storage.py function into the raisimGymTorch folder there is a function in which we try to compute the return. I think that the return it means is: $$G_t = rt + \gamma r{t+1} + \dots $$

delta = self.rewards[step] + next_is_not_terminal * gamma * next_values - self.values[step] 
advantage = delta + next_is_not_terminal * gamma * lam * advantage
self.returns[step] = advantage + self.values[step] 

self.values is the output of the critic, approximated as a neural network. So, I think it is a q value. I mean, to evaluate if the action of the actor is good we need a q-value, but I suspect that In this code, the output of the critic is treated as a value-function, because of the third row of code.

I find a bit odd the second row. Where that update rule come from? My knowledge are limited to the fact that: $$A(s,a) = Q(s,a) -V(s)$$ And we can update the Q-value using the temporal difference error: $$Q(s,a) = Q(s,a) + \alpha \left (r_{t+1} + \gamma Q(s',a) - Q(s,a) \right )$$ My question is, how this is related to the second row in the code? And once I get the advantage function, why it sum the q-value (or value function) to the advantage in order to get the return? Thank you for the help =)

jhwangbo commented 1 year ago

The adavntage you wrote is the definition. We can estimate it in various ways. delta is simply says the difference between the estimated value from this sample and the estimate from the network. The second line is generalized advantage estimation. You should read about it here https://arxiv.org/abs/1506.02438 In the third line, the subtracted self.values is added back. The resulting expression is the definition of the return.