P175中PPO是on-policy而非off-policy

intro-llm / intro-llm.github.io

website

353 stars 43 forks source link

P175中PPO是on-policy而非off-policy #57

Open Mizar77 opened 1 month ago

Mizar77 commented 1 month ago

P175中介绍PPO是off-policy的，并通过off-policy中的importance-sampling方法推导PPO的算法，但OpenAI中关于PPO的介绍是on-policy的，推导是TRPO的一阶求解方法。（具体详见：https://spinningup.openai.com/en/latest/algorithms/ppo.html）

igeng commented 1 month ago

https://www.yejiefeng.com/articles/2024/03/10/1710048235803.html