nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.57k stars 332 forks source link

Question on multiple actors #34

Closed pengzhi1998 closed 4 years ago

pengzhi1998 commented 4 years ago

Hi @nikhilbarhate99, thank you for your great work! However, I found in your Readme that: Number of actors for collecting experience = 1. This could be easily changed by creating multiple instances of ActorCritic networks in the PPO class and using them to collect experience (like A3C and standard PPO). But how to change it? What should I modify for the gradient ascent part if there are multiple actors in parallel? Thank you for your help!

nikhilbarhate99 commented 4 years ago

I am not sure how other repos implement parallel RL (If you want robust and working code, I would suggest you use existing implementations like RLLib or Ilya Kostrikov's repo).

A simplistic approach would be:

  1. Initialize a shared_policy with random weights
  2. Create and Run N new instances of the policy (all having same weights as shared policy) in N env instances (i.e. 1 policy instance in 1 env instance) in parallel. In each instance collect the experience (one or many episodes per instance) and calculate returns (with Monte Carlo or GAE or Bootstrapping etc.) and store it in a shared_buffer.
  3. Update the shared_policy weights with experience in shared_buffer.
  4. Jump to step 2

EDIT: one of the forks of this repo has been modified to a parallel implementation.

pengzhi1998 commented 4 years ago

Thank you very much! I'll take a look.