Closed ClaudioChiariello closed 1 year ago
The adavntage you wrote is the definition. We can estimate it in various ways.
delta
is simply says the difference between the estimated value from this sample and the estimate from the network. The second line is generalized advantage estimation. You should read about it here
https://arxiv.org/abs/1506.02438
In the third line, the subtracted self.values
is added back. The resulting expression is the definition of the return.
Hello everyone, I am trying to understand this portion of code for days. In the storage.py function into the raisimGymTorch folder there is a function in which we try to compute the return. I think that the return it means is: $$G_t = rt + \gamma r{t+1} + \dots $$
self.values is the output of the critic, approximated as a neural network. So, I think it is a q value. I mean, to evaluate if the action of the actor is good we need a q-value, but I suspect that In this code, the output of the critic is treated as a value-function, because of the third row of code.
I find a bit odd the second row. Where that update rule come from? My knowledge are limited to the fact that: $$A(s,a) = Q(s,a) -V(s)$$ And we can update the Q-value using the temporal difference error: $$Q(s,a) = Q(s,a) + \alpha \left (r_{t+1} + \gamma Q(s',a) - Q(s,a) \right )$$ My question is, how this is related to the second row in the code? And once I get the advantage function, why it sum the q-value (or value function) to the advantage in order to get the return? Thank you for the help =)