VPG implementation: the value function fitting seems inaccurate by using reward-to-go as target values

The reward-to-go is actually an approximation for Q(s, a), instead of V(s). By definition V(s) is defined as an expectation of Q(s, a) over all actions. However, since calculating a more accurate V value through sampling is both time consuming and high-variance, we are left with no choice other than adopting this inaccurate way. Or perhaps there is another explanation which is, if there are multiple label values corresponds to the same state V(s), then a neural network will deal with it and average among different label values during training. Is that correct ?

openai / spinningup

VPG implementation: the value function fitting seems inaccurate by using reward-to-go as target values #334