hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.11k stars 728 forks source link

SubprocVecEnv produces identical outputs for all sub-processes #1120

Closed acertainKnight closed 3 years ago

acertainKnight commented 3 years ago

I am running an A2C algorithm and using the SubprocVecEnv function to distribute the computing across multiple CPUs. As a part of my environment I have check points which output the agent's stats from the environment at the time step. For some reason the printed outputs are identical for all 48 of my sub-processes.

I produced my environment using:

self.train_env = SubprocVecEnv([self._make_env(i, arguments) for i in range(n_procs)], start_method='fork')

def _make_env(self, rank, arguments, seed=0):
        def _init():
            env = Env(arguments)
            env.seed(seed + rank + random.randint(0, 10**50))
            return env

        set_random_seed(seed)
        return _init
Miffyli commented 3 years ago

Things get bit hairy with references in this vecenv creation part. I think set_random_seed(seed) sets the global seed that is used to produce random.randint, which results in all envs getting same seed. On top of this, rank variable is same for all envs upon creation, because it is a reference to the i variable in the loop.

I have used this kind of thing which works. Hope it is of some help.

PS: For future reference, you can do pretty multi-line code blocks with ``` this kind of text ``` :)

araffin commented 3 years ago

Please use make_vec_env (cf doc) and we also recommend you to switch to SB3.

acertainKnight commented 3 years ago

Apologies I am using SB3 I did not realize that I was posting in the wrong location. I had originally written the code based on the example listed in the documentation. I also tried the method @Miffyli suggested but this also does not seem to work. I am quite confused why either method wouldn't work. Both seem to me to be initializing the environments with a different seed but all still produce the same outputs.

Miffyli commented 3 years ago

I recommend trying make_vec_env as araffin suggested. If that fails, then something is happening on your environment level, as it is known to work with other environments. I suggest you debug your environment that changing seeds really works.

acertainKnight commented 3 years ago

This is helpful thank you. I still have the same issue but I can confirm the seeds are changing so I will need to dive deeper to find the issue in my environment. Just to confirm does make_vec_env distribute the environments over multiple CPUs or is it running the environments all on the same CPU? I am looking at my CPU utilization and it seems to be concentrated only only on a single core even when multiple environments are used.

Miffyli commented 3 years ago

SubProcVecEnv puts each environment on its own Python process, which can be parallelized across CPU cores (see parameters of the make_vec_env). Do note that this only parallelizes collecting samples, not actual training, so the gains may be small or even negative, with CPU cores not utilized.

If you have no more questions, feel free to close the issue.

araffin commented 3 years ago

Just to confirm does make_vec_env distribute the environments over multiple CPUs or is it running the environments all on the same CPU? I

SubProcVecEnv puts each environment on its own Python process, which can be parallelized across CPU cores (see parameters of the make_vec_env)

yes and by default, make_vec_env is using DummyVecEnv (usually faster but can be changed).

acertainKnight commented 3 years ago

Thank you both this has been a tremendous help. I believe I have found where the problem occurs though I am still unsure why this would cause an issue. To initialize my model I use a get_model function (reproduced below) within my DRL agent class.

The issue seems to occur as a result of my adding custom policy kwargs. Any idea what might cause that? If I run the models with the default policy and everything else the same I then get differentiated results from each environment as expected.

I also noticed that if I pull the model initialization out of the get_model function then I get a printed message that my GPU is being used which I otherwise don't get. The issue with the non-differentiated results stays the same.

MODELS = {
    "a2c": A2C,
    "ddpg": DDPG,
    "td3": TD3,
    "sac": SAC,
    "ppo": PPO}

MODEL_KWARGS = {x: config.__dict__[f"{x.upper()}_PARAMS"] for x in MODELS.keys()}

NOISE = {
    "normal": NormalActionNoise,
    "ornstein_uhlenbeck": OrnsteinUhlenbeckActionNoise,
}

class DRLAgent:
    @staticmethod
    def get_model(model_name,
                  env,
                  policy="MlpPolicy",
                  policy_kwargs=dict(activation_fn=th.nn.ReLU,
                             net_arch=[dict(pi=[100, 50, 25], vf=[100, 50, 25])]),
                  model_kwargs=None,
                  verbose=0):

        if model_name not in MODELS:
            raise NotImplementedError("NotImplementedError")

        if model_kwargs is None:
            temp_model_kwargs = MODEL_KWARGS[model_name]
        else:
            temp_model_kwargs = model_kwargs.copy()

        if "action_noise" in temp_model_kwargs:
            n_actions = env.action_space.shape[-1]
            temp_model_kwargs["action_noise"] = NOISE[temp_model_kwargs["action_noise"]](
                mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions)
            )

        model = MODELS[model_name](
            policy=policy,
            env=env,
            tensorboard_log=f"{config.TENSORBOARD_LOG_DIR}/{model_name}",
            verbose=verbose,
            policy_kwargs=policy_kwargs,
            **temp_model_kwargs,
        )
        return model
acertainKnight commented 3 years ago

Actually it would seem to be my use of the relu activation function specifically. Will need to investigate this further.

araffin commented 3 years ago

The issue seems to occur as a result of my adding custom policy kwargs. Any idea what might cause that? If I run the models with the default policy and everything else the same I then get differentiated results from each environment as expected.

Please use the RL Zoo, it is made for training agents and includes best practices.

Closing this as the original issue was solved.