vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.26k stars 602 forks source link

Why is there no design evaluation and save model module? #310

Open madlsj opened 1 year ago

madlsj commented 1 year ago

Problem Description

Why is there no design evaluation and save model module?

vwxyzjn commented 1 year ago

Hi @madlsj thanks for raising this issue. We are hoping to gradually enable model saving and evaluation at #292. So far, we have omitted evaluation for a list of reasons

That said, recent advances such as reincarnate RL aims to reuse existing models, models / evaluation can become increasingly relavent, so we are gradually adopting it.

JamesKCS commented 1 year ago

Hi @vwxyzjn, thank you for your excellent work on this awesome library. I have a couple questions closely related to the above, and thought I'd piggyback on this issue rather than making my own:

I am attempting to save/load a trained agent trained with code based on your ppo_continuous_action.py for evaluation. The agent performed significantly worse after reloading than it did during the latter stages of training. Eventually, I figured out what you mentioned above: I need to save and load the state of the NormalizeObservation wrapper, since that is effectively a learned part of the agent.

So I am saving and load the observation RunningMeanStd.mean, RunningMeanStd.var, and RunningMeanStd.count variables. Since it is parallelized across n environments, and there seem to be n instances of the above variables, I simply grab the one from the environment with index 0. Q1: Is just grabbing those variables from index 0 a reasonable solution, or do I need to do something more sophisticated?

Before I run more long trials, Q2: Are there any other similar "gotchas" that could also cause worse performance when saving/loading an agent? (I know that if I want to resume training, I'd have to also save/load the reward normalization RunningMeanStd data, but that shouldn't be necessary just for evaluation, correct?) Thank you!

vwxyzjn commented 1 year ago

Hi @JamesKCS, thanks for sharing these issues.

Q1: Is just grabbing those variables from index 0 a reasonable solution, or do I need to do something more sophisticated?

I recommend applying the normalize wrappers at the vectorized environment level like

https://github.com/vwxyzjn/envpool-cleanrl/blob/880552a168c08af334b5e5d8868bfbe5ea881445/ppo_continuous_action_envpool.py#L194-L198

which performs better in my experience when num_envs>1. Then you can grab the RunningMeanStd.mean which is the running mean of all the envs. A better solution in the future is to implement the normalize wrappers in the torch or jax side so that the implementation could also benefit from hardware acceleration and a more straightforward model-saving procedure.

Q2: Are there any other similar "gotchas" that could also cause worse performance when saving/loading an agent?

I think ppo_continuous_action is the main one with the environment storing some training states. Others should be fine. With #292 we should also have a more quantitative answer to this question.

JamesKCS commented 1 year ago

@vwxyzjn Just an update: the above did seem to fix the problem. After retraining and using the learned normalized observations, the newly loaded agent had returns similar to the those from the end of the training. Thank you again!

qiuruiyu commented 1 year ago

Sorry, I have a question. Now I have a setpoint control problem and used ppo_continuous_action.py for training. in evaluation, I write the code as the following:

env = gym.vector.SyncVectorEnv(
        [make_env(1, 1, False, "1", 0.99)]
)
state_dict = torch.load('./runs/Shell-v0__ppo_continuous_action__1__1687066897/agent_500.pt')
agent = Agent(env)
agent.load_state_dict(state_dict, strict=True)

s, _ = env.reset() 

while True:
    s_mean = env.envs[0].obs_rms.mean 
    s_var = env.envs[0].obs_rms.var 
    s_epsilon = env.envs[0].epsilon 

    # a, log_prob, entropy, value = agent.get_action_and_value(torch.FloatTensor(s))
    a = agent.actor_mean(torch.FloatTensor(s)).detach().numpy()
    s_, r, terminated, truncated, info = env.step(a)

    s_re = s * np.sqrt(s_var + s_epsilon) + s_mean 
    y.append(s_re.reshape(9, -1)[3:6, -1])
    du.append(s_re.reshape(9, -1)[-3:, -1])
    if "final_info" not in info:
                    continue
    for info in info["final_info"]:
        # Skip the envs that are not done
        if info is None:
            continue
        print(f"episodic_return={info['episode']['r']}")
    if truncated:
        break 
    s = s_ 

For one thing, the episodic_return printed is worse than trained, then, I found that even at the state just reset, the action given by the policy is quite different with trained. Is there any problem with my code?

vwxyzjn commented 1 year ago

Sorry, I have a question. Now I have a setpoint control problem and used ppo_continuous_action.py for training. in evaluation, I write the code as the following:

env = gym.vector.SyncVectorEnv(
        [make_env(1, 1, False, "1", 0.99)]
)
state_dict = torch.load('./runs/Shell-v0__ppo_continuous_action__1__1687066897/agent_500.pt')
agent = Agent(env)
agent.load_state_dict(state_dict, strict=True)

s, _ = env.reset() 

while True:
    s_mean = env.envs[0].obs_rms.mean 
    s_var = env.envs[0].obs_rms.var 
    s_epsilon = env.envs[0].epsilon 

    # a, log_prob, entropy, value = agent.get_action_and_value(torch.FloatTensor(s))
    a = agent.actor_mean(torch.FloatTensor(s)).detach().numpy()
    s_, r, terminated, truncated, info = env.step(a)

    s_re = s * np.sqrt(s_var + s_epsilon) + s_mean 
    y.append(s_re.reshape(9, -1)[3:6, -1])
    du.append(s_re.reshape(9, -1)[-3:, -1])
    if "final_info" not in info:
                    continue
    for info in info["final_info"]:
        # Skip the envs that are not done
        if info is None:
            continue
        print(f"episodic_return={info['episode']['r']}")
    if truncated:
        break 
    s = s_ 

For one thing, the episodic_return printed is worse than trained, then, I found that even at the state just reset, the action given by the policy is quite different with trained. Is there any problem with my code?

Maybe @JamesKCS can share his snippet? The issue is that during training you need to save the states of the wrappers in the environment, such as obs_rms.mean along with agent.pt, which was not done in the snippet you shared

qiuruiyu commented 1 year ago

@vwxyzjn update the question above. I just remove all env.wrappers except for FlattenObservation() and RecordEpisodicStatistics() because I have already did normalization in my env.py. From the print info, reward converges still. However the test is not normal.

        env_eval = gym.vector.SyncVectorEnv(
            [make_env(1, 1, False, "1", 0.99)]
    )   
        epi_r_eval = 0 
        s_eval, _ = env_eval.reset() 
        while True: 
            a_eval, _, _, _ = agent.get_action_and_value(torch.FloatTensor(s_eval))
            a_eval = a_eval.detach().numpy() 
            s_eval_, r_eval, ter, trun, info_eval = env_eval.step(a_eval)
            epi_r_eval += r_eval 
            if trun:
                break 
            s_evl = s_eval_

I added the following code for evaluation after writer.add_scalar but the result of evaluation is greatly different with training. Log is as the following, really strange. help me plz:

global_step=26400, episodic_return=[-45.999065]
global_step=26500, episodic_return=[-50.956413]
global_step=26600, episodic_return=[-50.758144]
SPS: 2534
----------- EVALUATION ---------------
EPISODE REWARD:  [-300791.29141884]
--------------------------------
global_step=26700, episodic_return=[-102.30504]
global_step=26800, episodic_return=[-36.258324]
global_step=26900, episodic_return=[-54.978573]
vwxyzjn commented 1 year ago

Roughly speaking, you should do

eval_env.obs_rms = env.obs_rms
eval_env.return_rms = env.return_rms
qiuruiyu commented 1 year ago

Roughly speaking, you should do

eval_env.obs_rms = env.obs_rms
eval_env.return_rms = env.return_rms

I have removed these wrappers but it still shows bad performance when evaluation?maybe it shouldn't be the problem of obs_rms

vwxyzjn commented 1 year ago

You should not remove these wrappers because they are what the agent was trained on.

qiuruiyu commented 1 year ago

yes.I tried to remove them in make_env function which means the agent was trained on a original env but the result was the same. you mean, set eval_env.obs_rms=env.obs_rms and the env is wrapped by normalizeObservation, and the agent can be evaluated in a right way?

vwxyzjn commented 1 year ago

Yes. During training, the agent sees normalized observations, but during your evaluation, the agent sees unnormalized obs, which is prob the reason its performance was bad.

qiuruiyu commented 1 year ago

ppo_continuous_action.txt Sorry again. In order to upload the file I change the suffix to txt. At the last of the code I add the part for regular evaluation. I tried your advice to set the obs_rms of eval env same as training env. But it didn't work. Then, I tried Pendulum-v1, and Only leave env = gym.wrappers.RecordEpisodeStatistics(env) in make_env function. However, it still performs just like random action or something else. Therefore, I think it's not a problem of obs_rms, because I have totally get rid of all wrappers both in the training and evaluation envs.

vwxyzjn commented 1 year ago

PPO does not solve Pendulum-v1 if I recall correctly.

qiuruiyu commented 1 year ago

PPO does not solve Pendulum-v1 if I recall correctly.

Pendulum-v1 is also a continuous action env and I test it. And here is the result.

image
qiuruiyu commented 1 year ago

PPO does not solve Pendulum-v1 if I recall correctly.

Sorry, No matter how, My question is why the evaluation is failed. Please, I really need it.

vwxyzjn commented 1 year ago

Ok I gave it a shot at https://wandb.ai/costa-huang/cleanRL/runs/3nhnaboz. It should work.

Edit: it did work

eval_episodic_return=[-838.558] 5488 eval_episodic_return=[-693.1612] 5489 eval_episodic_return=[-122.41518] 5490 eval_episodic_return=[-126.593254] 5491 eval_episodic_return=[-1525.9708] 5492 eval_episodic_return=[-1092.5537] 5493 eval_episodic_return=[-7.024438] 5494 eval_episodic_return=[-249.01135] 5495 eval_episodic_return=[-827.8679] 5496 ----------- EVALUATION --------------- 5497 EPISODE REWARD: -587.4559 5498 --------------------------------

Long story short, there were some subtle issues with the way you use gym’s api. I’d suggest doing a file diff to identify those differences.

JamesKCS commented 1 year ago

Maybe @JamesKCS can share his snippet?

My solution is hacky, since I just load the wrapper from one environment, rather than the more correct way which is to use

the running mean of all the envs.

However, I'm happy to share in case it helps anyone else (it shouldn't be hard to modify my solution to be more correct):

Save with:

pickle_item(envs.get_obs_norm_rms_obj(), saved_policies_dir+"trial_%s_obs_rms_obj_pickle" % trial_num)

Load with:

def load_learned_observation_normalization(envs, trial_num):
    rms_obj = load_pickled_item("saved_policies/trial_%s_obs_rms_obj_pickle" % trial_num)
    envs.set_obs_norm_rms_obj(rms_obj)

Helper functions:

class Environment():
    def __init__(self, env_id, num_envs, seed, device, capture_video, run_name, gamma):
        self.gym_sync_vec_env = gym.vector.SyncVectorEnv([self.clean_rl_make_env(env_id, seed + i, i, capture_video, run_name, gamma) for i in range(num_envs)])
        self.capture_video = False
        self.num_envs = num_envs
        self.global_step = 0
        self.device = device

    def get_NormalizeObservation_wrapper(self, env_num=0):
        return self.gym_sync_vec_env.envs[env_num].env.env.env

    def get_obs_norm_rms_obj(self, env_num=0):
        return self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms

    def set_obs_norm_rms_obj(self, rms_obj, env_num=0):
        self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms = rms_obj

    def get_obs_norm_rms_vars(self, env_num=0):
        rms_obj = self.get_obs_norm_rms_obj(env_num)
        return rms_obj.mean, rms_obj.var, rms_obj.count

def pickle_item(item, filename):
    with open(filename, 'wb') as file:
        pickle.dump(item, file)

def load_pickled_item(filename):
    with open(filename, 'rb') as file:
        return pickle.load(file)

Edit: added the relevant bits of my Environment class, and some relevant helper functions. For anyone looking to imitate this, I recommend trying to understand what I did rather than blindly copying, since this was just a quick hack to get things working.

qiuruiyu commented 1 year ago

Maybe @JamesKCS can share his snippet?

My solution is hacky, since I just load the wrapper from one environment, rather than the more correct way which is to use

the running mean of all the envs.

However, I'm happy to share in case it helps anyone else (it shouldn't be hard to modify my solution to be more correct):

Save with:

pickle_item(envs.get_obs_norm_rms_obj(), saved_policies_dir+"trial_%s_obs_rms_obj_pickle" % trial_num)

Load with:

def load_learned_observation_normalization(envs, trial_num):
    rms_obj = load_pickled_item("saved_policies/trial_%s_obs_rms_obj_pickle" % trial_num)
    envs.set_obs_norm_rms_obj(rms_obj)

Helper functions:

    def get_NormalizeObservation_wrapper(self, env_num=0):
        return self.gym_sync_vec_env.envs[env_num].env.env.env

    def get_obs_norm_rms_obj(self, env_num=0):
        return self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms

    def set_obs_norm_rms_obj(self, rms_obj, env_num=0):
        self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms = rms_obj

    def get_obs_norm_rms_vars(self, env_num=0):
        rms_obj = self.get_obs_norm_rms_obj(env_num)
        return rms_obj.mean, rms_obj.var, rms_obj.count

Is your evaluation part wrapped during the training process? like, you evaluate once every 10000 steps. If I save the agent for one training, and then I load it in another py or Jupyter notebook. After I create the env, wrapped it with the same NormalizeObservation. Do I need to save / load obs_rms again? The agent receive the normalized observation, so I don't need to do any process when evaluation because the obs has already been normalized?

JamesKCS commented 1 year ago

Is your evaluation part wrapped during the training process? like, you evaluate once every 10000 steps. If I save the agent for one training, and then I load it in another py or Jupyter notebook. After I create the env, wrapped it with the same NormalizeObservation. Do I need to save / load obs_rms again? The agent receive the normalized observation, so I don't need to do any process when evaluation because the obs has already been normalized?

I'm not sure that I fully understand the question, so I'll just walk you through how it works.

  1. When saving the agent, you also have to save the env wrapper (in my code, this is the pickle_item call).
  2. When loading an agent (whether for evaluation or for further training), you have to load the env wrapper. In my code, this is the load_learned_observation_normalization call.

Remember that the environment wrapper is really part of the policy/agent (even though it's not represented that way in the code). So, if you don't load the wrapper correctly when loading a saved policy/agent, you aren't loading the complete policy/agent (since you are effectively "resetting" this component of the policy/agent), and should expect bad results.

Does that clear things up? If you are still confused, it might be worth rereading the top few posts on this thread, since they explain the basic problem and solution. Best of luck with your work! :)

JamesKCS commented 1 year ago

@qiuruiyu P.S. I edited the code snippets above to give more context.

roger-creus commented 10 months ago

Roughly speaking, you should do

eval_env.obs_rms = env.obs_rms
eval_env.return_rms = env.return_rms

For everyone else, env.envs[0].obs_rms will access the Reward RMS wrapper because it is added later. You should overwrite env.envs[0].env.env.env.obs_rms instead