How does creation of multiple runners with A2C work?

Pit-Storm commented 3 years ago

Hi @hill-a ,

I am very confused about the runner creation with A2C Modeltraining.

I consultated the code in a2c.py (espacially the referenced line and its dependencies). But I am note able to find out how this code actually creates multiple synchronous runners (or workers, as it is named in the docs) in multiple copies of the environment.

What I read in the openAI Blogpost for A2C is that they are using multiple synchronous workers.

What I understand out of the code is:

It does a rollout that is n_steps=5 long
So the runner takes n_steps in the env in on policy way by the model.
With the return of the rollout we take a train (gradient) step. This takes place total_timesteps // (n_batch=5*n_envs) + 1 repeatitions.

Is the multiple worker setup done in 'def setup_model()' (line 118)? (I guess it because there self.n_batch is calculated) Or is the number of workers dependend to the number of envs?

(A side question to this is: openAI saying, that they implemented a deterministic variant of A2C. Why is there a switch for deterministic actions. I really don't get it...)

Hopefully some can understand my confusion about that things...

Thanks for helping!

araffin commented 3 years ago

Hello,

I consultated the code in a2c.py (espacially the referenced line and its dependencies). But I am note able to find out how this code actually creates multiple synchronous runners (or workers, as it is named in the docs) in multiple copies of the environment.

have you read the documentation? https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html

Anyway, I recommend you to use SB3 now: https://github.com/DLR-RM/stable-baselines3 (easier to read and same API)

hat they implemented a deterministic variant of A2C. Why is there a switch for deterministic actions. I really don't get it...

I think you are confusing "deterministic" in the sense of "deterministic program" (when processes run asynchronously, it is not deterministic, you can read more about multi-threading/multi-processing for that) and stochastic vs deterministic policy (and for that please read more about RL, see resources in the doc ;)).

Pit-Storm commented 3 years ago

Thanks for the fast reply.

I read the docs before, yes. But either I am overlooking a point or it is not explecitly pointed out what this has to do with A2Cs workers. Even in A2C section of the docs nothing is said to that point. Even I can not find a well named parameter to set the number of workers...

The usage of vecEnvs is for paralelize computation on haevy environments (subprocVecEnv). And DummyVecEnv is just to wrap when non heavy envs are used. So I don't get the point what this will help me with A2C workers...

Sadly I am tied to TF on a specific version on the universities machine, so I am not able to use SB3.

Thank you for the hint with deterministic program. This clearifies it!

araffin commented 3 years ago

The usage of vecEnvs is for paralelize computation on haevy environments (subprocVecEnv). And DummyVecEnv is just to wrap when non heavy envs are used. So I don't get the point what this will help me with A2C workers...

You should probably take a look at the "Multiprocessing" notebook from our tutorial: https://github.com/araffin/rl-tutorial-jnrr19

A2C workers are the environments of the VecEnv

Pit-Storm commented 3 years ago

I am aware of how multiprocessing works in stable_baselines. But your replies showed me, that I have to sharpen the question.

A2C workers are the environments of the VecEnv

If multiple workers with A2C are only used, when one is passing multiple environmets to the model, this should be made more clear in the documentstion for the A2C part. The point is that the multiple workers are a key element of the algorithm. The multiple workers are the reason why there is no experience replay needed. The workers decorrelete the minibatches applied to the train network.

My point os about concise language: In the A2C case multiple workers are explicitly not used for performance reasons. They are used to avoid experience replay.

This should be the case in all other algorithms that are not using another technique to decorrelate minibatches.

I think this should be take in account when weiting about vectorized environments in the docs.

(Tell me if I am wrong with the decorelation argument or if I missed a paper that showed, that we don‘t need that anymore. I am open for suggestions :-) )

araffin commented 3 years ago

I am aware of how multiprocessing works in stable_baselines. But your replies showed me, that I have to sharpen the question.

The notebook is not only about how to use them but also about the meaning/tradeoff that come with them.

from the notebook: " Vectorized Environments are a method for stacking multiple independent environments into a single environment. Instead of training an RL agent on 1 environment per step, it allows us to train it on n environments per step. This provides two benefits:

Agent experience can be collected more quickly
The experience will contain a more diverse range of states, it usually improves exploration

Stable-Baselines provides two types of Vectorized Environment:

SubprocVecEnv which run each environment in a separate process
DummyVecEnv which run all environment on the same process

In practice, DummyVecEnv is usually faster than SubprocVecEnv because of communication delays that subprocesses have. "

In the A2C case multiple workers are explicitly not used for performance reasons. They are used to avoid experience replay.

do you mean performance in term of wall-time or episodic reward?

The A3C paper sells it with two arguments:

it is fast (that's why the x-time of the figure is the wall time, no the timesteps as in DQN)
it does not require a replay buffer (which make it lightweight)

in practice, the number of workers also affect the final performance. (but sometimes, you don't need any worker, see PPO on Continuous Control tasks)

Pit-Storm commented 3 years ago

I think we on the same side, but viewing the aspect from different sides.

Inunderstand the connections totally now.

Maybe you can consider to make it clearer for distrubuted algorithms: If one wants to have zhr distribution of workers, one has tu use multiple environments.

✌

Thank you for the discussion. This was very exciting!

hill-a / stable-baselines

How does creation of multiple runners with A2C work? #1095