When to Update - Githubissues

nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch

MIT License

1.66k stars 343 forks source link

When to Update #21

Closed xunzhang closed 4 years ago

xunzhang commented 4 years ago

Hi, in the original PPO paper, it runs T timesteps(e.g. 1 actor) and then update K times with mini-batch size M sampled from T memory. But in your implementation, you will do update every M steps within T memory. There are two differences in my understanding:

PPO paper samples K times of M data from T size memory but in your implementation you never sample, and use the same M data to do the model update.
PPO paper updates the policy after T timestamps, but your implementation updates the model every M steps within T timestamps.

I wonder whether the above differences affect the performance of PPO and how?

Thanks.

nikhilbarhate99 commented 4 years ago

PPO Algorithm (paper):

for iteration=1, 2, . . . do
  for actor=1, 2, . . . , N do
    Run policy πθold in environment for T timesteps
    Compute advantage estimates A
  end for
Optimize surrogate L wrt θ, with K epochs and minibatch size M ≤ NT
θold ← θ
end for

In this repo, N = 1 (one actor), batch size M = T. i.e the sample is the entire batch.

Given that performance of the algorithm is dependent on the environment, I am not sure as to how this will affect its overall efficiency. It is a hyper parameter and need to be tuned according to the environment.

But Using parallel workers (N>1) is generally more useful since the expectations are approximated with experience generated by different random seeds.

xunzhang commented 4 years ago

In PPO.py the T=300(max_timesteps=300) and the M=2000(update_timestep=2000), why you said M=T? Little confused here. Do you want to simulate multiple actors(N) by setting M > T. So in the PPO.py example, 300(T) * 6.66(N) = 2000(M). Correct me if I am wrong.

nikhilbarhate99 commented 4 years ago

Update Timestep (T) = 2000 Mini-Batch size (M) = 2000

max_timesteps is the maximum timesteps in ONE episode. One update may have experience from multiple episodes.

for iteration=1, 2, . . . do
  for actor=1, 2, . . . , N do
    Run policy πθold in environment for T timesteps
    Compute advantage estimates A
  end for
Optimize surrogate L wrt θ, with K epochs and minibatch size M ≤ NT
θold ← θ
end for

Using Multiple Actors (N), means to run multiple instances of actors (Parallel / Multithreaded), all collecting experience of length T. For updating, Mini-batch size(M) can NOT be greater than the total batch size (NT)

xunzhang commented 4 years ago

I see. I misread the max_timesteps in your code as T in the paper. I think update_timestep in your code is =M, =T.

One more confusion with multiple actors, it makes sense to use parallel environments, but why I can't use the N*T sequential process to simulate parallel environments?

nikhilbarhate99 commented 4 years ago

All the instances will be running with different random seeds. This will lead to more varied experience, thus approximating the expectation better.

Source: skip to 54:19 of (https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=6)