tambetm / gymexperiments

MIT License
28 stars 12 forks source link

A2C or A3C? #8

Closed slerman12 closed 5 years ago

slerman12 commented 5 years ago

Hi, I am learning how to implement RL algorithms and I found your code very intuitive. My only confusion is with the multiprocessing of your A2C implementation. I am not very familiar with Python multiprocessing, but is this synchronous or asynchronous? I can't find the logic in the code that causes the runners to synchronize before updating the trainer.

Also, if you would not mind helping me understand the implementation differences better, can you tell me whether this approach of using multiple processes of copied agent models is better in terms of performance or training time than the alternative approach of parallelizing multiple copies of the environment and executing only one model (as in the OpenAI baseline)?

tambetm commented 5 years ago

It is synchronous in the sense that trainer waits for one batch from each runner before proceeding with training. It is not synchronous in the sense that runners do not wait for the training to be finished and continue stepping the environment. All information flows only one way: experiences from runners are sent to trainer via FIFOs and weights of the model are regularly synced back using shared memory.

Regarding multiple models vs one model - the idea was that runners use CPU for forward pass and GPU is used for training only. My original use-case was Minecraft, which is real-time environment. In real-time environment you cannot wait till all environments have finished their step, you have to act immediately, hence independent runners.

Of course this introduces policy lag, samples are not strictly on-policy any more. To be mathematically correct one should use importance sampling as in PPO or IMPALA. I have more elaborate implementation with PPO as well: https://github.com/tambetm/TSCL/tree/master/minecraft. Don't be scared by Minecraft, the implementation learns classic control and Atari games as well. But admittedly not as well as OpenAI baselines.

By now I have realized that A2C implementation can be simplified by passing advantages to model as Keras sample_weights. And PPO implementation has some bugs that might cause it to learn worse than OpenAI.