The reward-to-go is actually an approximation for Q(s, a), instead of V(s).
By definition V(s) is defined as an expectation of Q(s, a) over all actions.
However, since calculating a more accurate V value through sampling is both time consuming and high-variance, we are left with no choice other than adopting this inaccurate way.
Or perhaps there is another explanation which is, if there are multiple label values corresponds to the same state V(s), then a neural network will deal with it and average among different label values during training.
Is that correct ?
The reward-to-go is actually an approximation for Q(s, a), instead of V(s). By definition V(s) is defined as an expectation of Q(s, a) over all actions. However, since calculating a more accurate V value through sampling is both time consuming and high-variance, we are left with no choice other than adopting this inaccurate way. Or perhaps there is another explanation which is, if there are multiple label values corresponds to the same state V(s), then a neural network will deal with it and average among different label values during training. Is that correct ?