Closed utkarshp closed 2 years ago
We are aware of the non-determinism issue with rllib and is on our todo list to figure out. Thanks for pointing it out.
Thanks! Do you know if it's just a matter of using the global random state throughout the code vs using a separate random state in every class/file? I can try this out myself (on my end. Also happy to contribute if allowed), but somehow I think there might be something more going on.
Hi @utkarshp, I actually investigated the problem a little bit and it turns out reproducibility should not be an issue anymore as long as you use tune (across all different resource specifications i.e. gpu, cpu, num_worker > 0, etc.). RLlib should support reproducible experimentation as long as the environment is deterministic. You can checkout https://github.com/ray-project/ray/blob/master/rllib/examples/deterministic_training.py to see the example.
I have also tested DQN on a small deterministic cartpole example and it works fine. The code is shared below:
import unittest
import gym
import ray
from ray.tune.registry import register_env
from ray import tune
from ray.rllib.algorithms.dqn import DQNConfig, DQN
class DeterministicCartPole(gym.Env):
def __init__(self, seed=0):
self.env = gym.make("CartPole-v0")
self.env.seed(seed)
self.action_space = self.env.action_space
self.observation_space = self.env.observation_space
def reset(self):
return self.env.reset()
def step(self, action):
return self.env.step(action)
seed = 0
print(f"Starting tests with seed = {seed}")
register_env(
"deterministic_env",
lambda _: DeterministicCartPole(seed=seed),
)
config = (
DQNConfig()
.environment(env="deterministic_env")
.resources(num_gpus=1)
.debugging(seed=seed)
.rollouts(num_rollout_workers=0)
.framework("torch")
)
tune.run(
DQN,
name="DQN_DETERMINISTIC_CARTPOLE",
config=config.to_dict(),
stop={"timesteps_total": 1e4},
)
This is amazing! I just tried using tune in my example and things seem to be deterministic! Just out of curiosity, what is it about using tune that makes things deterministic like this? Does it somehow force RLLib to use some other random state? Thanks a lot for your help @kouroshHakha. I am not sure if I should close this issue or not, so leaving it open for now.
For anyone that finds this issue later, and like me, hasn't used tune before, I made the following changes after the definition of the SimpleOffPolicyServing class in my example to get a similar run:
class MyCallback(Callback):
def on_trial_result(self, iteration: int, trials: List["Trial"],
trial: "Trial", result: Dict, **info):
r_int = random.randint(0, 2 ** 32)
print(
"Iteration {}, reward {}, timesteps {}, rnum {}".format(
iteration, result["episode_reward_mean"], result["timesteps_total"], r_int
)
)
random.seed(1)
def stopper(_, result):
return result["episode_reward_mean"] >= 80 or result["training_iteration"] >= 50
class TestExternalEnv(unittest.TestCase):
@classmethod
def setUpClass(cls) -> None:
ray.init(ignore_reinit_error=True)
@classmethod
def tearDownClass(cls) -> None:
ray.shutdown()
def test_train_cartpole_off_policy(self):
print("Starting tests")
register_env(
"test3",
lambda _: PartOffPolicyServing(gym.make("CartPole-v0"), off_pol_frac=0.2),
)
config = {
"num_workers": 0,
"exploration_config": {"epsilon_timesteps": 100},
"env": "test3",
"seed": 1
}
torch.manual_seed(1)
np.random.seed(1)
for _ in framework_iterator(config, frameworks=("tf", "torch")):
# dqn = Dqn(env="test3", config=config)
tune.run("DQN", config=config, callbacks=[MyCallback()], stop=stopper)
# if result["episode_reward_mean"] < 80:
# raise Exception("failed to improve reward")
The code after this is unchanged. I see that the generated random integers are always the same. There is some difference in the generated output, namely the output is printed every 3 iterations, and I see a lot more metrics printed. I suppose I need to tune (lol) the arguments a bit more to get an output that is exactly the same as before.
So I have tried this with .train() as well and it is still reproducible, here is the exact code:
import unittest
import gym
import ray
from ray.tune.registry import register_env
from ray import tune
from ray.rllib.algorithms.dqn import DQNConfig, DQN
class DeterministicCartPole(gym.Env):
def __init__(self, seed=0):
self.env = gym.make("CartPole-v0")
self.env.seed(seed)
self.action_space = self.env.action_space
self.observation_space = self.env.observation_space
def reset(self):
return self.env.reset()
def step(self, action):
return self.env.step(action)
seed = 0
print(f"Starting tests with seed = {seed}")
register_env(
"deterministic_env",
lambda _: DeterministicCartPole(seed=seed),
)
config = (
DQNConfig()
.environment(env="deterministic_env")
.resources(num_gpus=1)
.debugging(seed=seed)
.rollouts(num_rollout_workers=8) # this for me has caused repro issues
.reporting(min_time_s_per_iteration=0) # This line is very important
.framework("torch")
)
# train() call
algo = config.build()
for i in range(3):
print(f'//// iteration {i}')
results = algo.train()
print(results['episode_reward_mean'])
# tune.run call
tune.run(
DQN,
name="DQN_DET_CARTPOLE",
config=config.to_dict(),
stop={"training_iteration": 3}
)
In the above code both methods should produce the same episode_reward_mean after three iterations. Your questions actually brought up some good points that I want to clarify here:
If you care about reproducibility you have to make sure that there is no stopping condition that is set based on wall-clock time. For example the above code snippet would have not worked if min_time_s_per_iteration
was left at default value of 1 (The default is set in SimpleQ's config object which DQN inherits from). This means that the algorithm should have waited at least 1 second per each iteration before moving to another iteration. Therefore, you have to wait more even if the min_train_timesteps_per_iteration
or min_sample_timesteps_per_iteration
are reached. This causes a perturbation in the random state at some point during training so you may still see differences between runs.
What happened + What you expected to happen
I am trying to make my code deterministic. I have tried setting different seeds that I could think of to a fixed value, but I don't seem to be able to make it work. I have narrowed the issue down to this: I think Ray/RLLib is using the global random state in a non-deterministic way somewhere during every call to train(). I was able to reproduce the issue in the attached script. Note that this is not my code, but copied from test_external_env.py, with edits to generate random numbers.
This has been particularly frustrating to me when trying to debug my own code which seems to crash or give unexpected outputs in some runs, but runs fine when I try to debug.
Here is the output I see from the attached code:
As you can see, each of the random numbers are different. If the code was deterministic, we would be getting same random numbers in every iteration. Note that even a constant number of calls to random.random within the train function would still guarantee that the rnums are all same. This behavior indicates presence of a non-deterministic number of calls to random.random()
Versions / Dependencies
Output of conda list:
CUDA Version: 11.6 OS: Linux 4.19.0-18-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
Reproduction script
Issue Severity
High: It blocks me from completing my task.