Open madlsj opened 1 year ago
Hi @madlsj thanks for raising this issue. We are hoping to gradually enable model saving and evaluation at #292. So far, we have omitted evaluation for a list of reasons
torch.save
at the end of the training). As a result, we have primarily put our attention on making the training code succinct.--capture-video
which saves videos to wandb. ppo_continuous_action.py
also needs to save the states in the normalize wrappers, which is tricky to implement. We also support JAX, which require a slightly different way to run evaluation loops.That said, recent advances such as reincarnate RL aims to reuse existing models, models / evaluation can become increasingly relavent, so we are gradually adopting it.
Hi @vwxyzjn, thank you for your excellent work on this awesome library. I have a couple questions closely related to the above, and thought I'd piggyback on this issue rather than making my own:
I am attempting to save/load a trained agent trained with code based on your ppo_continuous_action.py
for evaluation. The agent performed significantly worse after reloading than it did during the latter stages of training. Eventually, I figured out what you mentioned above: I need to save and load the state of the NormalizeObservation wrapper, since that is effectively a learned part of the agent.
So I am saving and load the observation RunningMeanStd.mean
, RunningMeanStd.var
, and RunningMeanStd.count
variables. Since it is parallelized across n
environments, and there seem to be n
instances of the above variables, I simply grab the one from the environment with index 0
. Q1: Is just grabbing those variables from index 0
a reasonable solution, or do I need to do something more sophisticated?
Before I run more long trials, Q2: Are there any other similar "gotchas" that could also cause worse performance when saving/loading an agent? (I know that if I want to resume training, I'd have to also save/load the reward normalization RunningMeanStd data, but that shouldn't be necessary just for evaluation, correct?) Thank you!
Hi @JamesKCS, thanks for sharing these issues.
Q1: Is just grabbing those variables from index 0 a reasonable solution, or do I need to do something more sophisticated?
I recommend applying the normalize wrappers at the vectorized environment level like
which performs better in my experience when num_envs>1
. Then you can grab the RunningMeanStd.mean
which is the running mean of all the envs. A better solution in the future is to implement the normalize wrappers in the torch
or jax
side so that the implementation could also benefit from hardware acceleration and a more straightforward model-saving procedure.
Q2: Are there any other similar "gotchas" that could also cause worse performance when saving/loading an agent?
I think ppo_continuous_action
is the main one with the environment storing some training states. Others should be fine. With #292 we should also have a more quantitative answer to this question.
@vwxyzjn Just an update: the above did seem to fix the problem. After retraining and using the learned normalized observations, the newly loaded agent had returns similar to the those from the end of the training. Thank you again!
Sorry, I have a question. Now I have a setpoint control problem and used ppo_continuous_action.py for training. in evaluation, I write the code as the following:
env = gym.vector.SyncVectorEnv(
[make_env(1, 1, False, "1", 0.99)]
)
state_dict = torch.load('./runs/Shell-v0__ppo_continuous_action__1__1687066897/agent_500.pt')
agent = Agent(env)
agent.load_state_dict(state_dict, strict=True)
s, _ = env.reset()
while True:
s_mean = env.envs[0].obs_rms.mean
s_var = env.envs[0].obs_rms.var
s_epsilon = env.envs[0].epsilon
# a, log_prob, entropy, value = agent.get_action_and_value(torch.FloatTensor(s))
a = agent.actor_mean(torch.FloatTensor(s)).detach().numpy()
s_, r, terminated, truncated, info = env.step(a)
s_re = s * np.sqrt(s_var + s_epsilon) + s_mean
y.append(s_re.reshape(9, -1)[3:6, -1])
du.append(s_re.reshape(9, -1)[-3:, -1])
if "final_info" not in info:
continue
for info in info["final_info"]:
# Skip the envs that are not done
if info is None:
continue
print(f"episodic_return={info['episode']['r']}")
if truncated:
break
s = s_
For one thing, the episodic_return printed is worse than trained, then, I found that even at the state just reset, the action given by the policy is quite different with trained. Is there any problem with my code?
Sorry, I have a question. Now I have a setpoint control problem and used ppo_continuous_action.py for training. in evaluation, I write the code as the following:
env = gym.vector.SyncVectorEnv( [make_env(1, 1, False, "1", 0.99)] ) state_dict = torch.load('./runs/Shell-v0__ppo_continuous_action__1__1687066897/agent_500.pt') agent = Agent(env) agent.load_state_dict(state_dict, strict=True) s, _ = env.reset() while True: s_mean = env.envs[0].obs_rms.mean s_var = env.envs[0].obs_rms.var s_epsilon = env.envs[0].epsilon # a, log_prob, entropy, value = agent.get_action_and_value(torch.FloatTensor(s)) a = agent.actor_mean(torch.FloatTensor(s)).detach().numpy() s_, r, terminated, truncated, info = env.step(a) s_re = s * np.sqrt(s_var + s_epsilon) + s_mean y.append(s_re.reshape(9, -1)[3:6, -1]) du.append(s_re.reshape(9, -1)[-3:, -1]) if "final_info" not in info: continue for info in info["final_info"]: # Skip the envs that are not done if info is None: continue print(f"episodic_return={info['episode']['r']}") if truncated: break s = s_
For one thing, the episodic_return printed is worse than trained, then, I found that even at the state just reset, the action given by the policy is quite different with trained. Is there any problem with my code?
Maybe @JamesKCS can share his snippet? The issue is that during training you need to save the states of the wrappers in the environment, such as obs_rms.mean along with agent.pt, which was not done in the snippet you shared
@vwxyzjn update the question above. I just remove all env.wrappers except for FlattenObservation() and RecordEpisodicStatistics() because I have already did normalization in my env.py. From the print info, reward converges still. However the test is not normal.
env_eval = gym.vector.SyncVectorEnv(
[make_env(1, 1, False, "1", 0.99)]
)
epi_r_eval = 0
s_eval, _ = env_eval.reset()
while True:
a_eval, _, _, _ = agent.get_action_and_value(torch.FloatTensor(s_eval))
a_eval = a_eval.detach().numpy()
s_eval_, r_eval, ter, trun, info_eval = env_eval.step(a_eval)
epi_r_eval += r_eval
if trun:
break
s_evl = s_eval_
I added the following code for evaluation after writer.add_scalar
but the result of evaluation is greatly different with training. Log is as the following, really strange. help me plz:
global_step=26400, episodic_return=[-45.999065]
global_step=26500, episodic_return=[-50.956413]
global_step=26600, episodic_return=[-50.758144]
SPS: 2534
----------- EVALUATION ---------------
EPISODE REWARD: [-300791.29141884]
--------------------------------
global_step=26700, episodic_return=[-102.30504]
global_step=26800, episodic_return=[-36.258324]
global_step=26900, episodic_return=[-54.978573]
Roughly speaking, you should do
eval_env.obs_rms = env.obs_rms
eval_env.return_rms = env.return_rms
Roughly speaking, you should do
eval_env.obs_rms = env.obs_rms eval_env.return_rms = env.return_rms
I have removed these wrappers but it still shows bad performance when evaluation?maybe it shouldn't be the problem of obs_rms
You should not remove these wrappers because they are what the agent was trained on.
yes.I tried to remove them in make_env function which means the agent was trained on a original env but the result was the same.
you mean, set eval_env.obs_rms=env.obs_rms
and the env is wrapped by normalizeObservation, and the agent can be evaluated in a right way?
Yes. During training, the agent sees normalized observations, but during your evaluation, the agent sees unnormalized obs, which is prob the reason its performance was bad.
ppo_continuous_action.txt
Sorry again. In order to upload the file I change the suffix to txt. At the last of the code I add the part for regular evaluation. I tried your advice to set the obs_rms of eval env same as training env. But it didn't work.
Then, I tried Pendulum-v1, and Only leave env = gym.wrappers.RecordEpisodeStatistics(env)
in make_env
function. However, it still performs just like random action or something else.
Therefore, I think it's not a problem of obs_rms, because I have totally get rid of all wrappers both in the training and evaluation envs.
PPO does not solve Pendulum-v1
if I recall correctly.
PPO does not solve
Pendulum-v1
if I recall correctly.
Pendulum-v1 is also a continuous action env and I test it. And here is the result.
PPO does not solve
Pendulum-v1
if I recall correctly.
Sorry, No matter how, My question is why the evaluation is failed. Please, I really need it.
Ok I gave it a shot at https://wandb.ai/costa-huang/cleanRL/runs/3nhnaboz. It should work.
Edit: it did work
eval_episodic_return=[-838.558] 5488 eval_episodic_return=[-693.1612] 5489 eval_episodic_return=[-122.41518] 5490 eval_episodic_return=[-126.593254] 5491 eval_episodic_return=[-1525.9708] 5492 eval_episodic_return=[-1092.5537] 5493 eval_episodic_return=[-7.024438] 5494 eval_episodic_return=[-249.01135] 5495 eval_episodic_return=[-827.8679] 5496 ----------- EVALUATION --------------- 5497 EPISODE REWARD: -587.4559 5498 --------------------------------
Long story short, there were some subtle issues with the way you use gym’s api. I’d suggest doing a file diff to identify those differences.
Maybe @JamesKCS can share his snippet?
My solution is hacky, since I just load the wrapper from one environment, rather than the more correct way which is to use
the running mean of all the envs.
However, I'm happy to share in case it helps anyone else (it shouldn't be hard to modify my solution to be more correct):
Save with:
pickle_item(envs.get_obs_norm_rms_obj(), saved_policies_dir+"trial_%s_obs_rms_obj_pickle" % trial_num)
Load with:
def load_learned_observation_normalization(envs, trial_num):
rms_obj = load_pickled_item("saved_policies/trial_%s_obs_rms_obj_pickle" % trial_num)
envs.set_obs_norm_rms_obj(rms_obj)
Helper functions:
class Environment():
def __init__(self, env_id, num_envs, seed, device, capture_video, run_name, gamma):
self.gym_sync_vec_env = gym.vector.SyncVectorEnv([self.clean_rl_make_env(env_id, seed + i, i, capture_video, run_name, gamma) for i in range(num_envs)])
self.capture_video = False
self.num_envs = num_envs
self.global_step = 0
self.device = device
def get_NormalizeObservation_wrapper(self, env_num=0):
return self.gym_sync_vec_env.envs[env_num].env.env.env
def get_obs_norm_rms_obj(self, env_num=0):
return self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms
def set_obs_norm_rms_obj(self, rms_obj, env_num=0):
self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms = rms_obj
def get_obs_norm_rms_vars(self, env_num=0):
rms_obj = self.get_obs_norm_rms_obj(env_num)
return rms_obj.mean, rms_obj.var, rms_obj.count
def pickle_item(item, filename):
with open(filename, 'wb') as file:
pickle.dump(item, file)
def load_pickled_item(filename):
with open(filename, 'rb') as file:
return pickle.load(file)
Edit: added the relevant bits of my Environment class, and some relevant helper functions. For anyone looking to imitate this, I recommend trying to understand what I did rather than blindly copying, since this was just a quick hack to get things working.
Maybe @JamesKCS can share his snippet?
My solution is hacky, since I just load the wrapper from one environment, rather than the more correct way which is to use
the running mean of all the envs.
However, I'm happy to share in case it helps anyone else (it shouldn't be hard to modify my solution to be more correct):
Save with:
pickle_item(envs.get_obs_norm_rms_obj(), saved_policies_dir+"trial_%s_obs_rms_obj_pickle" % trial_num)
Load with:
def load_learned_observation_normalization(envs, trial_num): rms_obj = load_pickled_item("saved_policies/trial_%s_obs_rms_obj_pickle" % trial_num) envs.set_obs_norm_rms_obj(rms_obj)
Helper functions:
def get_NormalizeObservation_wrapper(self, env_num=0): return self.gym_sync_vec_env.envs[env_num].env.env.env def get_obs_norm_rms_obj(self, env_num=0): return self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms def set_obs_norm_rms_obj(self, rms_obj, env_num=0): self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms = rms_obj def get_obs_norm_rms_vars(self, env_num=0): rms_obj = self.get_obs_norm_rms_obj(env_num) return rms_obj.mean, rms_obj.var, rms_obj.count
Is your evaluation part wrapped during the training process? like, you evaluate once every 10000 steps. If I save the agent for one training, and then I load it in another py or Jupyter notebook. After I create the env, wrapped it with the same NormalizeObservation. Do I need to save / load obs_rms again? The agent receive the normalized observation, so I don't need to do any process when evaluation because the obs has already been normalized?
Is your evaluation part wrapped during the training process? like, you evaluate once every 10000 steps. If I save the agent for one training, and then I load it in another py or Jupyter notebook. After I create the env, wrapped it with the same NormalizeObservation. Do I need to save / load obs_rms again? The agent receive the normalized observation, so I don't need to do any process when evaluation because the obs has already been normalized?
I'm not sure that I fully understand the question, so I'll just walk you through how it works.
pickle_item
call).load_learned_observation_normalization
call.Remember that the environment wrapper is really part of the policy/agent (even though it's not represented that way in the code). So, if you don't load the wrapper correctly when loading a saved policy/agent, you aren't loading the complete policy/agent (since you are effectively "resetting" this component of the policy/agent), and should expect bad results.
Does that clear things up? If you are still confused, it might be worth rereading the top few posts on this thread, since they explain the basic problem and solution. Best of luck with your work! :)
@qiuruiyu P.S. I edited the code snippets above to give more context.
Roughly speaking, you should do
eval_env.obs_rms = env.obs_rms eval_env.return_rms = env.return_rms
For everyone else, env.envs[0].obs_rms
will access the Reward RMS wrapper because it is added later. You should overwrite env.envs[0].env.env.env.obs_rms
instead
Problem Description
Why is there no design evaluation and save model module?