nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.63k stars 340 forks source link

Question about GAE #13

Closed CatIIIIIIII closed 4 years ago

CatIIIIIIII commented 4 years ago

Dear nik: I noticed that your code store train data into one buffer from different episode, but use GAE to calculation accumulative reward. I am a little confused here cause shouldn't GAE be used on one same episode? Regards.

nikhilbarhate99 commented 4 years ago

Hey, this repo does not use GAE. The returns are simply the mote carlo estimate.

Either way, we store all the data in one buffer and also store the masks, i.e is_terminals, dones, etc. These masks are used to determine if the episode has ended and calculate the returns accordingly.