intro-llm / intro-llm.github.io

website
353 stars 43 forks source link

P175中PPO是on-policy而非off-policy #57

Open Mizar77 opened 1 month ago

Mizar77 commented 1 month ago

P175中介绍PPO是off-policy的,并通过off-policy中的importance-sampling方法推导PPO的算法,但OpenAI中关于PPO的介绍是on-policy的,推导是TRPO的一阶求解方法。(具体详见:https://spinningup.openai.com/en/latest/algorithms/ppo.html

igeng commented 1 month ago

https://www.yejiefeng.com/articles/2024/03/10/1710048235803.html