The calculation details of the value function in the advantage function

hari-sikchi / AWAC

Advantage weighted Actor Critic for Offline RL

MIT License

47 stars 8 forks source link

The definition of the advantage function is A(s,a) = Q(s,a) - V(s). It seems that V(s) is not explicitly calculated in the code (here), as

# V(s), why, isn't this the Q of at the current action?
pi, logp_pi = self.ac.pi(o)
q1_pi = self.ac.q1(o, pi)
q2_pi = self.ac.q2(o, pi)
v_pi = torch.min(q1_pi, q2_pi)

# Q(s,a)
q1_old_actions = self.ac.q1(o, data['act'])
q2_old_actions = self.ac.q2(o, data['act'])
q_old_actions = torch.min(q1_old_actions, q2_old_actions)

# A(s,a)
adv_pi = q_old_actions - v_pi

Looking forward to your reply

hari-sikchi / AWAC

The calculation details of the value function in the advantage function #4