About Replicating the PPO Performance in the Hopper-V3 Environment

hyLiu1994 commented 3 months ago

I am trying to replicate the PPO Performance in the Hopper-V3 Environment, but I find some issues.

The first issue is about the document of DI-engine, the blue link in "https://di-engine-docs.readthedocs.io/zh-cn/latest/12_policies/ppo.html" as following is unavailable.

I find that the name of config file has changed two years ago.

The second issue is that I cannot replicate PPO Performance by utilizing new config file "hopper_onppo_config.py", my result is as following:

In this process, I don't change anything.

Can you give me any suggestions about replicating PPO Performance in the Hopper-V3 Environment.

Thank you.

PaParaZz1 commented 3 months ago

Thanks for your feedback. We have already fix this doc problem. And we are re-running a experiment about Hoppe+PPO to check its performance.

hyLiu1994 commented 2 months ago

After checking code, I found one issue which may lead to above issues.

These are double calculations for info['eval_episode_return'].

The first calculation is line 28 in dizoo/mujoco/envs/mujoco_wrappers.py, the EvalEpisodeReturnWrapper will calculate eval_episode_return by utilizing the raw reward from gym.
The second calculation is line 88, 111, and 119 in dizoo/mujoco/envs/mujoco_env.py.

I think the second calculation is wrong, because before the second calculation, we may have normalized the reward.

However, after I remove the second calculation and add reward normalization, I still found that the reward mean and eval_episode_return_mean are same, my changes as following:

--- a/dizoo/mujoco/config/hopper_onppo_config.py
+++ b/dizoo/mujoco/config/hopper_onppo_config.py
@@ -2,11 +2,12 @@ from easydict import EasyDict
 import torch.nn as nn

 hopper_onppo_config = dict(
-    exp_name='hopper_onppo_seed0',
+    exp_name='hopper_onppo_envNormalized_seed0',
     env=dict(
-        env_id='Hopper-v3',
-        norm_obs=dict(use_norm=False, ),
-        norm_reward=dict(use_norm=False, ),
+        env_id='Hopper-v2',
+        norm_obs=dict(use_norm=True, ),
+        norm_reward=dict(use_norm=True, reward_discount=0.99),
+        action_clip=True,
         collector_env_num=8,
         evaluator_env_num=10,
         n_evaluator_episode=10,
diff --git a/dizoo/mujoco/envs/mujoco_env.py b/dizoo/mujoco/envs/mujoco_env.py
index c150581a..81beb1d2 100644
--- a/dizoo/mujoco/envs/mujoco_env.py
+++ b/dizoo/mujoco/envs/mujoco_env.py
@@ -85,7 +85,7 @@ class MujocoEnv(BaseEnv):
             self._env.seed(self._seed)
         obs = self._env.reset()
         obs = to_ndarray(obs).astype('float32')
-        self._eval_episode_return = 0.
+        # self._eval_episode_return = 0.

         return obs

@@ -108,7 +108,7 @@ class MujocoEnv(BaseEnv):
         if self._action_clip:
             action = np.clip(action, -1, 1)
         obs, rew, done, info = self._env.step(action)
-        self._eval_episode_return += rew
+        # self._eval_episode_return += rew
         if done:
             if self._save_replay_gif:
                 path = os.path.join(
@@ -116,7 +116,7 @@ class MujocoEnv(BaseEnv):
                 )
                 save_frames_as_gif(self._frames, path)
                 self._save_replay_count += 1
-            info['eval_episode_return'] = self._eval_episode_return
+            # info['eval_episode_return'] = self._eval_episode_return

         obs = to_ndarray(obs).astype(np.float32)
         rew = to_ndarray([rew]).astype(np.float32)

The results are as following:

PaParaZz1 commented 2 months ago

I have re-runed hopper_onppo_config.py in 4 seeds and find the similar results with our document. Do you use the latest main branch of DI-engine? You should not use any obs/rew norm in this config.

opendilab / DI-engine

About Replicating the PPO Performance in the Hopper-V3 Environment #818