Closed pengzhi1998 closed 4 years ago
I am not sure how other repos implement parallel RL (If you want robust and working code, I would suggest you use existing implementations like RLLib or Ilya Kostrikov's repo).
A simplistic approach would be:
shared_policy
with random weightsshared policy
) in N env instances (i.e. 1 policy instance in 1 env instance) in parallel. In each instance collect the experience (one or many episodes per instance) and calculate returns (with Monte Carlo or GAE or Bootstrapping etc.) and store it in a shared_buffer
.shared_policy
weights with experience in shared_buffer
.EDIT: one of the forks of this repo has been modified to a parallel implementation.
Thank you very much! I'll take a look.
Hi @nikhilbarhate99, thank you for your great work! However, I found in your Readme that:
Number of actors for collecting experience = 1. This could be easily changed by creating multiple instances of ActorCritic networks in the PPO class and using them to collect experience (like A3C and standard PPO).
But how to change it? What should I modify for the gradient ascent part if there are multiple actors in parallel? Thank you for your help!