Open berryjk opened 1 year ago
Replicating this comparison for several of the other tuned cartpole examples shows the reward histograms match well for SimpleQ and DQN but not for A3C, A2C, PPO, SAC, APPO, and DDPPO.
@berryjk Thanks for putting in the time to create this issue! Just to make this clear before I start debugging this: The same checkpoint gives you different results when you evaluate with CLI vs with the python script for A3C, A2C, PPO, SAC, APPO, and DDPPO? (You are not comparing "training with CLI and evaluation with CLI" against "training with script and evaluation with script", right?)
@ArturNiederfahrenhorst That is correct. The discrepancy was observed by evaluating a single checkpoint using the two methods (as demonstrated in the provided code snippet). In each experiment the checkpoints were produced using the CLI. Thanks for your patience with my delayed response.
@ArturNiederfahrenhorst have you been able to reproduce the discrepancy?
@ArturNiederfahrenhorst is there any interest by the RLLib team in reproducing this issue?
Yes, there is a high interest. But we are under high load at the moment. We will get to this issue as soon as our prioritization permits it.
Thanks @ArturNiederfahrenhorst . Let me know how I can help.
Can you check what happens if you turn of preprocessing for the training? (config..experimental(_disable_preprocessor_api=True)
)
Since you do...
env_name = "CartPole-v1"
env = gym.make(env_name)
The env that you create is not wrapped in anything. It would be helpful if you could play around with this and see if it changes anything.
@ArturNiederfahrenhorst the training is currently being performed with a CLI call using the cartpole-ppo.yaml config file provided in the tuned_examples directory. Your comment seems to suggest that I should change over to the Python API to configure and run training. I should then reproduce the original issue using Python API to drive training with no changes to the config. Then I would generate results with your proposed config change of config..experimental(_disable_preprocessor_api=True)
to see if anything changed. Is this correct?
In case it helps, I was struggling with an issue that had a similar effect as the one described here, where rewards during training were much higher than the ones I was obtaining from fetching actions on the trained model, all in the Python API. I tried _disable_preprocessor_api=True
with no effect. In particular I was training a TD3 model:
config = (
TD3Config()
.framework("torch")
.rollouts(create_env_on_local_worker=True, observation_filter="MeanStdFilter")
.environment(env=MyEnv, env_config=my_env_config)
)
And evaluating with:
model = Algorithm.from_checkpoint(path)
env = MyEnv(my_env_config)
obs, _ = env.reset()
state = None
for i in range(env.total_steps):
action = model.compute_single_action(obs, state, explore=False)
obs, _reward, _terminated, _, _ = env.step(action)
Where MyEnv
is a custom class inheriting from gym.Env
.
I reproduced the issue both when running training with tune and directly with .train()
, and both using the raw model immediately after training and loading it from a checkpoint file. I've seen it on 2.7.0 and 2.7.1, and on an M2 Mac and Ubuntu 22.04.
In this process I noticed that compute_single_action
was not calling any MeanStdFilter
instance (by adding log statements in rllib/utils/filter.py
). I worked around the issue by calling it manually before passing the observation to compute_single_action
:
c = model.get_policy().agent_connectors.connectors[1]
action = model.compute_single_action(c.filter(obs), state, explore=False)
After this the evaluation actions started matching the ones I was getting during training.
What happened + What you expected to happen
I am getting different reward distributions when I use the Python API vs the rllib CLI on a PPO checkpoint. Below is a comparison of reward histograms produced by each method. This issue has been reproduced on ray 2.4, 2.5, and 2.6 and with other environments. This issue does not occur with DQN.
Bin Edges: [ 0. 25. 50. 75. 100. 125. 150. 175. 200. 225. 250. 275. 300. 325. 350. 375. 400. 425. 450. 475. 500. 525.] CLI Histogram: [ 0 0 0 0 0 0 0 0 1 0 3 6 1 4 3 2 2 2 2 3 71] Python Histogram: [ 0 0 0 0 0 42 41 10 5 1 0 0 0 0 1 0 0 0 0 0 0]
To train I'm using:
Where cartpole-ppo.yaml is the one provided in the tuned_examples directory:
The checkpoint is evaluated using the CLI and episode rewards are printed to stdout:
The same checkpoint is also evaluated using the Python API like this (Note: setting exploration on/off does not resolve the discrepancy and neither does using the compute_single_action method directly from the Policy object):
Versions / Dependencies
ray[rllib]==2.6
Reproduction script
Here is the full script that trains, evaluates, scrapes data from stdout, and creates the histograms. Repeating this process with DQN shows matching histograms for each evaluation method.
Issue Severity
High: It blocks me from completing my task.