nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.57k stars 332 forks source link

(Solved) No env.reset() at the end of each training epoch. #67

Open slDeng1003 opened 4 months ago

slDeng1003 commented 4 months ago

Existing code:】 Only reset the environment at the beginning of training loop, that is, only call env.reset() at the first epoch. 【Right(might) training paradigm】 I checked OpenAI spinning-up's implement of PPO https://github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py, they do reset the env at the end of each epoch (same as reset it at the beginning of each epoch).

Correct me if I were wrong:)

P.S.: It;s still nice code!

ZheruiHuang commented 3 months ago

Hello! I think the training code is logically the same as OpenAI's.

Maybe you are misled by these two similar loops: https://github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/ppo/ppo.py#L299 and https://github.com/nikhilbarhate99/PPO-PyTorch/blob/728cce83d7ab628fe2634eabcdf3239997eb81dd/train.py#L173 In the former (OpenAI's) implementation, this loop will perform more than one episode, and it calls reset when an episode is done (but not jump out the loop). In the latter (this repo's) implementation, the loop performs only one episode. When an episode is done, it breaks the loop and resets the env (before the next episode begins).

Hope it makes scene to you!

slDeng1003 commented 3 months ago

Dear Huang, I appreciate your reply. I have checked the code and find out that you are right. Thank you again for your help!👍 @ZheruiHuang