Closed HareshKarnan closed 4 years ago
However, when I run the learnt stochastic policy in a deterministic way (select just the mean of action and not sample from the gaussian distribution), i'm able to get returns of upto 3600, and always above 3000.
It is not uncommon for the papers to produce the learning curve with deterministic policy during the evaluation time (from brief looking at the code for SAC paper does exactly that by default) :
https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/algos/base.py#L143
That being said, to answer your question directly, here's what happened to Hopper between versions v1 and v2:
$ git diff v0.9.5..v0.7.4 -- gym/envs/mujoco/hopper.py
diff --git a/gym/envs/mujoco/hopper.py b/gym/envs/mujoco/hopper.py
index 28fb144..2a5a399 100644
--- a/gym/envs/mujoco/hopper.py
+++ b/gym/envs/mujoco/hopper.py
@@ -8,9 +8,9 @@ class HopperEnv(mujoco_env.MujocoEnv, utils.EzPickle):
utils.EzPickle.__init__(self)
def _step(self, a):
- posbefore = self.sim.data.qpos[0]
+ posbefore = self.model.data.qpos[0, 0]
self.do_simulation(a, self.frame_skip)
- posafter, height, ang = self.sim.data.qpos[0:3]
+ posafter, height, ang = self.model.data.qpos[0:3, 0]
alive_bonus = 1.0
reward = (posafter - posbefore) / self.dt
reward += alive_bonus
@@ -23,8 +23,8 @@ class HopperEnv(mujoco_env.MujocoEnv, utils.EzPickle):
def _get_obs(self):
return np.concatenate([
- self.sim.data.qpos.flat[1:],
- np.clip(self.sim.data.qvel.flat, -10, 10)
+ self.model.data.qpos.flat[1:],
+ np.clip(self.model.data.qvel.flat, -10, 10)
])
def reset_model(self):
so while it is possible that the reward function has changed, it would not be due to an explicit change in gym, but due to some change in mujoco physics simulator between 1.50 and 1.31. Closing this issue.
Hi, To my knowledge, I think hopper-v1 is deprecated and Hopper-v2 is the standard hopper as of today. Can someone validate if this is true ?
In most of the RL papers, I see results where the authors report a return of more than 3000 for their algorithm (for example, Soft Actor Critic), i'm unable to get those results by rolling out a stochastic policy trained for 1M timesteps on the Hopper-v2 environment. Was the reward function changed between Hopper-v1 and Hopper-v2 ?