Definition of the advantage function in the runner.py

Hello everyone, I am trying to understand this portion of code for days. In the storage.py function into the raisimGymTorch folder there is a function in which we try to compute the return. I think that the return it means is: $$G_t = rt + \gamma r{t+1} + \dots $$

delta = self.rewards[step] + next_is_not_terminal * gamma * next_values - self.values[step] 
advantage = delta + next_is_not_terminal * gamma * lam * advantage
self.returns[step] = advantage + self.values[step]

self.values is the output of the critic, approximated as a neural network. So, I think it is a q value. I mean, to evaluate if the action of the actor is good we need a q-value, but I suspect that In this code, the output of the critic is treated as a value-function, because of the third row of code.

I find a bit odd the second row. Where that update rule come from? My knowledge are limited to the fact that: $$A(s,a) = Q(s,a) -V(s)$$ And we can update the Q-value using the temporal difference error: $$Q(s,a) = Q(s,a) + \alpha \left (r_{t+1} + \gamma Q(s',a) - Q(s,a) \right )$$ My question is, how this is related to the second row in the code? And once I get the advantage function, why it sum the q-value (or value function) to the advantage in order to get the return? Thank you for the help =)

raisimTech / raisimLib

Definition of the advantage function in the runner.py #410