hari-sikchi / AWAC

Advantage weighted Actor Critic for Offline RL
MIT License
47 stars 8 forks source link

The calculation details of the value function in the advantage function #4

Open TianQi-777 opened 3 years ago

TianQi-777 commented 3 years ago

The definition of the advantage function is A(s,a) = Q(s,a) - V(s). It seems that V(s) is not explicitly calculated in the code (here), as

# V(s), why, isn't this the Q of at the current action?
pi, logp_pi = self.ac.pi(o)
q1_pi = self.ac.q1(o, pi)
q2_pi = self.ac.q2(o, pi)
v_pi = torch.min(q1_pi, q2_pi)

# Q(s,a)
q1_old_actions = self.ac.q1(o, data['act'])
q2_old_actions = self.ac.q2(o, data['act'])
q_old_actions = torch.min(q1_old_actions, q2_old_actions)

# A(s,a)
adv_pi = q_old_actions - v_pi

Looking forward to your reply

linhlpv commented 10 months ago

Hi there @TianQi-777. I got the same question with you when I read the implementation code for AWAC paper. I'm not sure but I guess what they did in here is that they try to approximate the value of V-function at the state $s$ via the Monte Carlo estimation (by sampling an action $a'$ from the current policy) and use this approximated value as the value for V-function and compute the advantage based on it. Hope to discuss more too :D.