Open TianQi-777 opened 3 years ago
Hi there @TianQi-777. I got the same question with you when I read the implementation code for AWAC paper. I'm not sure but I guess what they did in here is that they try to approximate the value of V-function at the state $s$ via the Monte Carlo estimation (by sampling an action $a'$ from the current policy) and use this approximated value as the value for V-function and compute the advantage based on it. Hope to discuss more too :D.
The definition of the advantage function is
A(s,a) = Q(s,a) - V(s)
. It seems thatV(s)
is not explicitly calculated in the code (here), asLooking forward to your reply