openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.83k stars 4.88k forks source link

GAE and Critic Loss (PPO2) #1190

Open rohey opened 3 years ago

rohey commented 3 years ago

Hello. Can you please explain why are you using the mb_returns = mb_advs(GAE) + mb_values as the returns to compute the critic loss ? Should not the value function approximately represent the discounted sum of rewards ? E.g., R = gamma * R + rewards[i] value_loss = value_loss + 0.5 * (R - values[i]).pow(2).

If I understand correctly, the value function depends on the parameter γ and not on the parameter λ based on the paper https://arxiv.org/pdf/1506.02438.pdf. However, If I use the GAE(γ, λ) advantages to compute the returns and use those returns to train the critic, wouldn't the the value function become V(γ, λ) instead of V(γ) ? And if it is true, would the TD residual be correctly computed with V(γ, λ) ?