RLLib issue with making the program deterministic #27292

utkarshp closed 2 years ago

utkarshp commented 2 years ago

What happened + What you expected to happen

I am trying to make my code deterministic. I have tried setting different seeds that I could think of to a fixed value, but I don't seem to be able to make it work. I have narrowed the issue down to this: I think Ray/RLLib is using the global random state in a non-deterministic way somewhere during every call to train(). I was able to reproduce the issue in the attached script. Note that this is not my code, but copied from test_external_env.py, with edits to generate random numbers.

This has been particularly frustrating to me when trying to debug my own code which seems to crash or give unexpected outputs in some runs, but runs fine when I try to debug.

Here is the output I see from the attached code:

❯ python src/test_random.py
Starting tests
2022-07-29 18:25:04,690 INFO services.py:1338 -- View the Ray dashboard at
Starting tests
2022-07-29 18:25:28,200 INFO trainer.py:722 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also want to then set `eager_tracing=True` in order to reach similar execution speed as with static-graph mode.
2022-07-29 18:25:28,200 INFO dqn.py:141 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
2022-07-29 18:25:28,200 INFO trainer.py:743 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2022-07-29 18:25:45,027 WARNING deprecation.py:45 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
2022-07-29 18:25:45,463 INFO trainable.py:124 -- Trainable.setup took 17.264 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
Iteration 0, reward 17.50877192982456, timesteps 1000, rnum 2191835438
Iteration 1, reward 13.05, timesteps 2000, rnum 3479080518
Iteration 2, reward 15.61, timesteps 3000, rnum 2426226194
Iteration 3, reward 21.28, timesteps 4000, rnum 2342326304
Iteration 4, reward 27.15, timesteps 5000, rnum 3122304301
Iteration 5, reward 36.52, timesteps 6000, rnum 910541280
Iteration 6, reward 44.63, timesteps 7000, rnum 910541280
Iteration 7, reward 53.59, timesteps 8000, rnum 3327464432
Iteration 8, reward 61.15, timesteps 9000, rnum 1644020759
Iteration 9, reward 68.07, timesteps 10000, rnum 910541280
Iteration 10, reward 75.88, timesteps 11000, rnum 910541280
Iteration 11, reward 84.43, timesteps 12000, rnum 880162936
2022-07-29 18:26:25,133 WARNING deprecation.py:45 -- DeprecationWarning: `convert_to_non_torch_type` has been deprecated. Use `ray/rllib/utils/numpy.py::convert_to_numpy` instead. This will raise an error in the future!
Iteration 0, reward 10.826086956521738, timesteps 1000, rnum 1581442016
Iteration 1, reward 11.99, timesteps 2000, rnum 1280829708
Iteration 2, reward 15.25, timesteps 3000, rnum 1012006804
Iteration 3, reward 22.11, timesteps 4000, rnum 3430153998
Iteration 4, reward 31.44, timesteps 5000, rnum 1664651095
Iteration 5, reward 39.48, timesteps 6000, rnum 1286226248
Iteration 6, reward 48.45, timesteps 7000, rnum 910541280
Iteration 7, reward 57.75, timesteps 8000, rnum 3327464432
Iteration 8, reward 66.63, timesteps 9000, rnum 2750021556
Iteration 9, reward 75.1, timesteps 10000, rnum 910541280
Iteration 10, reward 84.41, timesteps 11000, rnum 3327464432

As you can see, each of the random numbers are different. If the code was deterministic, we would be getting same random numbers in every iteration. Note that even a constant number of calls to random.random within the train function would still guarantee that the rnums are all same. This behavior indicates presence of a non-deterministic number of calls to random.random()

Versions / Dependencies

Output of conda list:

CUDA Version: 11.6 OS: Linux 4.19.0-18-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux

Reproduction script

import random
import unittest

import gym
import numpy as np
import ray
import torch
from ray.rllib.agents.dqn import DQNTrainer as Dqn
from ray.rllib.env.external_env import ExternalEnv
from ray.rllib.utils.test_utils import framework_iterator
from ray.tune.registry import register_env

def make_simple_serving(multiagent, superclass):
    class SimpleServe(superclass):
        def __init__(self, env):
            superclass.__init__(self, env.action_space, env.observation_space)
            self.env = env

        def run(self):
            eid = self.start_episode()
            obs = self.env.reset()
            while True:
                action = self.get_action(eid, obs)
                obs, reward, done, info = self.env.step(action)
                if multiagent:
                    self.log_returns(eid, reward)
                    self.log_returns(eid, reward, info=info)
                if done:
                    print("Ended episode", eid)
                    self.end_episode(eid, obs)
                    obs = self.env.reset()
                    eid = self.start_episode()

    return SimpleServe

# generate & register SimpleServing class
SimpleServing = make_simple_serving(False, ExternalEnv)

class PartOffPolicyServing(ExternalEnv):
    def __init__(self, env, off_pol_frac):
        ExternalEnv.__init__(self, env.action_space, env.observation_space)
        self.env = env
        self.off_pol_frac = off_pol_frac
        self.rs = np.random.RandomState(seed=1)

    def run(self):
        eid = self.start_episode()
        obs = self.env.reset()
        while True:
            if self.rs.random() < self.off_pol_frac:
                action = self.env.action_space.sample()
                self.log_action(eid, obs, action)
                action = self.get_action(eid, obs)
            obs, reward, done, info = self.env.step(action)
            self.log_returns(eid, reward, info=info)
            if done:
                self.end_episode(eid, obs)
                obs = self.env.reset()
                eid = self.start_episode()

class SimpleOffPolicyServing(ExternalEnv):
    def __init__(self, env, fixed_action):
        ExternalEnv.__init__(self, env.action_space, env.observation_space)
        self.env = env
        self.fixed_action = fixed_action

    def run(self):
        eid = self.start_episode()
        obs = self.env.reset()
        while True:
            action = self.fixed_action
            self.log_action(eid, obs, action)
            obs, reward, done, info = self.env.step(action)
            self.log_returns(eid, reward, info=info)
            if done:
                self.end_episode(eid, obs)
                obs = self.env.reset()
                eid = self.start_episode()

class TestExternalEnv(unittest.TestCase):
    def setUpClass(cls) -> None:

    def tearDownClass(cls) -> None:

    def test_train_cartpole_off_policy(self):
        print("Starting tests")
            lambda _: PartOffPolicyServing(gym.make("CartPole-v0"), off_pol_frac=0.2),
        config = {
            "num_workers": 0,
            "exploration_config": {"epsilon_timesteps": 100},
            "seed": 1
        for _ in framework_iterator(config, frameworks=("tf", "torch")):
            dqn = Dqn(env="test3", config=config)
            reached = False
            for i in range(50):
                result = dqn.train()
                r_int = random.randint(0, 2**32)   # I would expect this to be the same integer in every run if everything is deterministic
                    "Iteration {}, reward {}, timesteps {}, rnum {}".format(
                        i, result["episode_reward_mean"], result["timesteps_total"], r_int
                if result["episode_reward_mean"] >= 80:
                    reached = True
            if not reached:
                raise Exception("failed to improve reward")

if __name__ == "__main__":
    import pytest
    import sys

    t = TestExternalEnv()

Issue Severity

High: It blocks me from completing my task.

kouroshHakha commented 2 years ago

We are aware of the non-determinism issue with rllib and is on our todo list to figure out. Thanks for pointing it out.

utkarshp commented 2 years ago

Thanks! Do you know if it's just a matter of using the global random state throughout the code vs using a separate random state in every class/file? I can try this out myself (on my end. Also happy to contribute if allowed), but somehow I think there might be something more going on.

kouroshHakha commented 2 years ago

Hi @utkarshp, I actually investigated the problem a little bit and it turns out reproducibility should not be an issue anymore as long as you use tune (across all different resource specifications i.e. gpu, cpu, num_worker > 0, etc.). RLlib should support reproducible experimentation as long as the environment is deterministic. You can checkout https://github.com/ray-project/ray/blob/master/rllib/examples/deterministic_training.py to see the example.

I have also tested DQN on a small deterministic cartpole example and it works fine. The code is shared below:

import unittest

import gym
import ray
from ray.tune.registry import register_env
from ray import tune

from ray.rllib.algorithms.dqn import DQNConfig, DQN

class DeterministicCartPole(gym.Env):

    def __init__(self, seed=0):
        self.env = gym.make("CartPole-v0")
        self.action_space = self.env.action_space
        self.observation_space = self.env.observation_space

    def reset(self):
        return self.env.reset()

    def step(self, action):
        return self.env.step(action)

seed = 0
print(f"Starting tests with seed = {seed}")
    lambda _: DeterministicCartPole(seed=seed),
config = (

    stop={"timesteps_total": 1e4},
utkarshp commented 2 years ago

This is amazing! I just tried using tune in my example and things seem to be deterministic! Just out of curiosity, what is it about using tune that makes things deterministic like this? Does it somehow force RLLib to use some other random state? Thanks a lot for your help @kouroshHakha. I am not sure if I should close this issue or not, so leaving it open for now.

For anyone that finds this issue later, and like me, hasn't used tune before, I made the following changes after the definition of the SimpleOffPolicyServing class in my example to get a similar run:

class MyCallback(Callback):
    def on_trial_result(self, iteration: int, trials: List["Trial"],
                        trial: "Trial", result: Dict, **info):
        r_int = random.randint(0, 2 ** 32)
            "Iteration {}, reward {}, timesteps {}, rnum {}".format(
                iteration, result["episode_reward_mean"], result["timesteps_total"], r_int

def stopper(_, result):
    return result["episode_reward_mean"] >= 80 or result["training_iteration"] >= 50

class TestExternalEnv(unittest.TestCase):
    def setUpClass(cls) -> None:

    def tearDownClass(cls) -> None:

    def test_train_cartpole_off_policy(self):
        print("Starting tests")
            lambda _: PartOffPolicyServing(gym.make("CartPole-v0"), off_pol_frac=0.2),
        config = {
            "num_workers": 0,
            "exploration_config": {"epsilon_timesteps": 100},
            "env": "test3",
            "seed": 1
        for _ in framework_iterator(config, frameworks=("tf", "torch")):
            # dqn = Dqn(env="test3", config=config)
            tune.run("DQN", config=config, callbacks=[MyCallback()], stop=stopper)
            # if result["episode_reward_mean"] < 80:
            #     raise Exception("failed to improve reward")

The code after this is unchanged. I see that the generated random integers are always the same. There is some difference in the generated output, namely the output is printed every 3 iterations, and I see a lot more metrics printed. I suppose I need to tune (lol) the arguments a bit more to get an output that is exactly the same as before.

kouroshHakha commented 2 years ago

So I have tried this with .train() as well and it is still reproducible, here is the exact code:

import unittest

import gym
import ray
from ray.tune.registry import register_env
from ray import tune

from ray.rllib.algorithms.dqn import DQNConfig, DQN

class DeterministicCartPole(gym.Env):

    def __init__(self, seed=0):
        self.env = gym.make("CartPole-v0")
        self.action_space = self.env.action_space
        self.observation_space = self.env.observation_space

    def reset(self):
        return self.env.reset()

    def step(self, action):
        return self.env.step(action)

seed = 0
print(f"Starting tests with seed = {seed}")
    lambda _: DeterministicCartPole(seed=seed),
config = (
    .rollouts(num_rollout_workers=8) # this for me has caused repro issues
    .reporting(min_time_s_per_iteration=0) # This line is very important

# train() call 
algo = config.build()
for i in range(3):
    print(f'//// iteration {i}')
    results = algo.train()

# tune.run call
            stop={"training_iteration": 3}

In the above code both methods should produce the same episode_reward_mean after three iterations. Your questions actually brought up some good points that I want to clarify here:

If you care about reproducibility you have to make sure that there is no stopping condition that is set based on wall-clock time. For example the above code snippet would have not worked if min_time_s_per_iteration was left at default value of 1 (The default is set in SimpleQ's config object which DQN inherits from). This means that the algorithm should have waited at least 1 second per each iteration before moving to another iteration. Therefore, you have to wait more even if the min_train_timesteps_per_iteration or min_sample_timesteps_per_iteration are reached. This causes a perturbation in the random state at some point during training so you may still see differences between runs.