ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.57k stars 829 forks source link

Determinism Bugfix #151

Closed FlorianKlemt closed 5 years ago

FlorianKlemt commented 5 years ago

Hi everyone,

I tried to get reproducible results (meaning the same sequence of actions, observations and rewards) on different Atari-Environments.

When running on CPU the results are reproducible, however when using CUDA, neither the actions chosen by the agent nor the observations/rewards are deterministic.

Interestingly when using only 1 worker, the results are reproducible even on CUDA.

When using:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

the actions chosen by the agent become reproducible on CUDA. (be careful when using these flags, they can and will greatly impact runtime)

What is weird is that the observations returned by the environment are still non-deterministic, even though the actions are deterministic.

After long search it seems like the issue is in envs.py in the class VecPyTorchFrameStack in the reset method.

def reset(self):
    obs = self.venv.reset()
    self.stacked_obs.zero_()
    self.stacked_obs[:, -self.shape_dim0:] = obs
    return self.stacked_obs

The inplace-zero seems to make results non-deterministic, when using multiple workers on CUDA. I am not sure how this is possible, but when replacing the line with:

self.stacked_obs = torch.zeros(self.stacked_obs.shape)

all results (action, observations, rewards) become completely reproducible in CUDA with multiple workers.

It would be great if you could check whether this fix makes results reproducible for you too. If yes i would propose to change the inplace-zero line. Without the cudnn-backend-determinism flags (which maybe you could add optionally via a determinism argument) it wont lead to determinism, but it might save the next programmer a lot of time.

Best regards, Florian Klemt





Here is my test file, which highlights the problem. To switch between deterministic and non-deterministic, replace the indicated lines in the reset method of VecPyTorchFrameStack.

import torch
import gym
import numpy as np
import random

seed = 1

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

torch.set_num_threads(1)

from baselines.common.vec_env import VecEnvWrapper
class VecPyTorchFrameStack(VecEnvWrapper):
    def __init__(self, venv, nstack, device=None):
        self.venv = venv
        self.nstack = nstack

        wos = venv.observation_space  # wrapped ob space
        self.shape_dim0 = wos.shape[0]

        low = np.repeat(wos.low, self.nstack, axis=0)
        high = np.repeat(wos.high, self.nstack, axis=0)

        if device is None:
            device = torch.device('cpu')
        self.stacked_obs = torch.zeros((venv.num_envs,) + low.shape).to(device)

        observation_space = gym.spaces.Box(
            low=low, high=high, dtype=venv.observation_space.dtype)
        VecEnvWrapper.__init__(self, venv, observation_space=observation_space)

    def step_wait(self):
        obs, rews, news, infos = self.venv.step_wait()
        self.stacked_obs[:, :-self.shape_dim0] = \
            self.stacked_obs[:, self.shape_dim0:]
        for (i, new) in enumerate(news):
            if new:
                self.stacked_obs[i] = 0
        self.stacked_obs[:, -self.shape_dim0:] = obs
        return self.stacked_obs, rews, news, infos

    def reset(self):
        obs = self.venv.reset()
        #self.stacked_obs.zero_() #this line is NON-DETERMINISTIC
        self.stacked_obs = torch.zeros(self.stacked_obs.shape) #DETERMINISTIC replacement
        self.stacked_obs[:, -self.shape_dim0:] = obs
        return self.stacked_obs

    def close(self):
        self.venv.close()

def make_vec_envs(env_name, seed, num_processes, gamma, log_dir, add_timestep,
                  device, allow_early_resets):
    from envs import make_env, VecPyTorch
    from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
    from baselines.common.vec_env.dummy_vec_env import DummyVecEnv

    envs = [make_env(env_name, seed, i, log_dir, add_timestep, allow_early_resets)
            for i in range(num_processes)]

    if len(envs) > 1:
        envs = SubprocVecEnv(envs)
    else:
        envs = DummyVecEnv(envs)

    envs = VecPyTorch(envs, device)

    envs = VecPyTorchFrameStack(envs, 4, device)

    return envs

def set_seeds(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    random.seed(seed)
    np.random.seed(seed)
    from gym.spaces import prng
    prng.seed(seed)

def generate_test_data(env, num_processes, nr_games, nr_actions_per_game):
    all_obs = []
    all_rewards = []
    all_actions = []
    obs = env.reset()
    all_obs.append(obs.cpu().numpy())
    for i in range(nr_games):
        for _ in range(nr_actions_per_game):
            action = [[env.action_space.sample()] for _ in range(num_processes)]
            torch_action = torch.from_numpy(np.array(action))
            obs, reward, done, info = env.step(torch_action)
            all_obs.append(obs.cpu().numpy())
            all_rewards.append(reward.cpu().numpy())
            all_actions.append(torch_action.cpu().numpy())

    return all_obs, all_rewards, all_actions

def test_wrapped_env(env_name, seed, processes, device, nr_games = 10, nr_actions_per_game = 50):
    set_seeds(seed)

    env = make_vec_envs(env_name=env_name, seed=seed, num_processes=processes, gamma=0.99, log_dir='/tmp/gym/', add_timestep=False, device=device, allow_early_resets=False)

    return generate_test_data(env, processes, nr_games, nr_actions_per_game)

####start of main
device = torch.device("cuda:0")
env_name = "PongNoFrameskip-v4"
processes = 8
nr_games = 50
nr_actions_per_game = 50
all_obs, all_rewards, all_actions = test_wrapped_env(env_name=env_name, seed=seed, processes=processes, device=device, nr_games=nr_games, nr_actions_per_game=nr_actions_per_game)
all_obs2, all_rewards2, all_actions2 = test_wrapped_env(env_name=env_name, seed=seed, processes=processes, device=device, nr_games=nr_games, nr_actions_per_game=nr_actions_per_game)

assert(len(all_rewards)==len(all_rewards2))
all_obs_equal, all_rewards_equal, all_actions_equal = True, True, True
for i in range(len(all_rewards)):
    obs_equal = np.equal(all_obs2[i],all_obs[i]).all()
    rewards_equal = np.equal(all_rewards2[i],all_rewards[i]).all()
    actions_equal = np.equal(all_actions2[i],all_actions[i]).all()
    all_obs_equal &= obs_equal
    all_rewards_equal &= rewards_equal
    all_actions_equal &= actions_equal

print("Obs equal: ", all_obs_equal)
print("Reward equal: ", all_rewards_equal)
print("Actions equal: ", all_actions_equal)
ikostrikov commented 5 years ago

Hi Florian!

That's awesome! We have been struggling with non-deterministic behavior from cuda for so long...I'm very grad that you have fixed the issue.

I think you also should create an issue in PyTorch repository since it seems to be a more general problem.

Best regards, Ilya

ikostrikov commented 5 years ago

Fixed in https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/pull/152