openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
10.18k stars 2.23k forks source link

VPG implementation: the value function fitting seems inaccurate by using reward-to-go as target values #334

Open robbine opened 3 years ago

robbine commented 3 years ago

The reward-to-go is actually an approximation for Q(s, a), instead of V(s). By definition V(s) is defined as an expectation of Q(s, a) over all actions. However, since calculating a more accurate V value through sampling is both time consuming and high-variance, we are left with no choice other than adopting this inaccurate way. Or perhaps there is another explanation which is, if there are multiple label values corresponds to the same state V(s), then a neural network will deal with it and average among different label values during training. Is that correct ?