Closed xunzhang closed 4 years ago
PPO Algorithm (paper):
for iteration=1, 2, . . . do
for actor=1, 2, . . . , N do
Run policy πθold in environment for T timesteps
Compute advantage estimates A
end for
Optimize surrogate L wrt θ, with K epochs and minibatch size M ≤ NT
θold ← θ
end for
In this repo, N = 1 (one actor), batch size M = T. i.e the sample is the entire batch.
Given that performance of the algorithm is dependent on the environment, I am not sure as to how this will affect its overall efficiency. It is a hyper parameter and need to be tuned according to the environment.
But Using parallel workers (N>1) is generally more useful since the expectations are approximated with experience generated by different random seeds.
In PPO.py
the T=300(max_timesteps=300
) and the M=2000(update_timestep=2000
), why you said M=T? Little confused here. Do you want to simulate multiple actors(N) by setting M > T. So in the PPO.py
example, 300(T) * 6.66(N) = 2000(M). Correct me if I am wrong.
Update Timestep (T) = 2000 Mini-Batch size (M) = 2000
max_timesteps
is the maximum timesteps in ONE episode. One update may have experience from multiple episodes.
for iteration=1, 2, . . . do
for actor=1, 2, . . . , N do
Run policy πθold in environment for T timesteps
Compute advantage estimates A
end for
Optimize surrogate L wrt θ, with K epochs and minibatch size M ≤ NT
θold ← θ
end for
Using Multiple Actors (N), means to run multiple instances of actors (Parallel / Multithreaded), all collecting experience of length T. For updating, Mini-batch size(M) can NOT be greater than the total batch size (NT)
I see. I misread the max_timesteps
in your code as T in the paper. I think update_timestep
in your code is =M, =T.
One more confusion with multiple actors, it makes sense to use parallel environments, but why I can't use the N*T
sequential process to simulate parallel environments?
All the instances will be running with different random seeds. This will lead to more varied experience, thus approximating the expectation better.
Source: skip to 54:19 of (https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=6)
Hi, in the original PPO paper, it runs T timesteps(e.g. 1 actor) and then update K times with mini-batch size M sampled from T memory. But in your implementation, you will do update every M steps within T memory. There are two differences in my understanding:
I wonder whether the above differences affect the performance of PPO and how?
Thanks.