Closed hyLiu1994 closed 2 months ago
Thanks for your feedback. We have already fix this doc problem. And we are re-running a experiment about Hoppe+PPO to check its performance.
After checking code, I found one issue which may lead to above issues.
These are double calculations for info['eval_episode_return'].
I think the second calculation is wrong, because before the second calculation, we may have normalized the reward.
However, after I remove the second calculation and add reward normalization, I still found that the reward mean and eval_episode_return_mean are same, my changes as following:
--- a/dizoo/mujoco/config/hopper_onppo_config.py
+++ b/dizoo/mujoco/config/hopper_onppo_config.py
@@ -2,11 +2,12 @@ from easydict import EasyDict
import torch.nn as nn
hopper_onppo_config = dict(
- exp_name='hopper_onppo_seed0',
+ exp_name='hopper_onppo_envNormalized_seed0',
env=dict(
- env_id='Hopper-v3',
- norm_obs=dict(use_norm=False, ),
- norm_reward=dict(use_norm=False, ),
+ env_id='Hopper-v2',
+ norm_obs=dict(use_norm=True, ),
+ norm_reward=dict(use_norm=True, reward_discount=0.99),
+ action_clip=True,
collector_env_num=8,
evaluator_env_num=10,
n_evaluator_episode=10,
diff --git a/dizoo/mujoco/envs/mujoco_env.py b/dizoo/mujoco/envs/mujoco_env.py
index c150581a..81beb1d2 100644
--- a/dizoo/mujoco/envs/mujoco_env.py
+++ b/dizoo/mujoco/envs/mujoco_env.py
@@ -85,7 +85,7 @@ class MujocoEnv(BaseEnv):
self._env.seed(self._seed)
obs = self._env.reset()
obs = to_ndarray(obs).astype('float32')
- self._eval_episode_return = 0.
+ # self._eval_episode_return = 0.
return obs
@@ -108,7 +108,7 @@ class MujocoEnv(BaseEnv):
if self._action_clip:
action = np.clip(action, -1, 1)
obs, rew, done, info = self._env.step(action)
- self._eval_episode_return += rew
+ # self._eval_episode_return += rew
if done:
if self._save_replay_gif:
path = os.path.join(
@@ -116,7 +116,7 @@ class MujocoEnv(BaseEnv):
)
save_frames_as_gif(self._frames, path)
self._save_replay_count += 1
- info['eval_episode_return'] = self._eval_episode_return
+ # info['eval_episode_return'] = self._eval_episode_return
obs = to_ndarray(obs).astype(np.float32)
rew = to_ndarray([rew]).astype(np.float32)
The results are as following:
I have re-runed hopper_onppo_config.py
in 4 seeds and find the similar results with our document.
Do you use the latest main branch of DI-engine? You should not use any obs/rew norm in this config.
I am trying to replicate the PPO Performance in the Hopper-V3 Environment, but I find some issues.
The first issue is about the document of DI-engine, the blue link in "https://di-engine-docs.readthedocs.io/zh-cn/latest/12_policies/ppo.html" as following is unavailable.
I find that the name of config file has changed two years ago.
The second issue is that I cannot replicate PPO Performance by utilizing new config file "hopper_onppo_config.py", my result is as following:
In this process, I don't change anything.
Can you give me any suggestions about replicating PPO Performance in the Hopper-V3 Environment.
Thank you.