Closed vwxyzjn closed 2 years ago
Got some feedback from @JesseFarebro regarding envpool's Breakout-v5
vs ALE/Breakout-v5
:
- They
envpool
aren't doing terminal signal on loss of life which in Breakout is the single most important setting. (we don't recommend it be used)- They are doing NOOPs without sticky actions and you're doing both (no one should be using NOOPs anymore).
- They don't do reward clipping by default? Check the magnitude of the rewards if you aren't clipping yourself (this is fine)
As shown in the script, I was already doing
envs = envpool.make(
args.gym_id,
env_type="gym",
num_envs=args.num_envs,
episodic_life=True,
reward_clip=True,
)
Maybe the reason is "They are doing NOOPs without sticky actions and you're doing both (no one should be using NOOPs anymore)."?
https://envpool.readthedocs.io/en/latest/api/atari.html
(no one should be using NOOPs anymore)."?
set noop_max=0
Hmm but the BreakoutNoFrameskip-v4
result clearly used the noop_max
.
def make_env(gym_id, seed, idx, capture_video, run_name):
def thunk():
env = gym.make(gym_id)
env = gym.wrappers.RecordEpisodeStatistics(env)
if capture_video:
if idx == 0:
env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
env = NoopResetEnv(env, noop_max=30)
env = MaxAndSkipEnv(env, skip=4)
env = EpisodicLifeEnv(env)
if "FIRE" in env.unwrapped.get_action_meanings():
env = FireResetEnv(env)
env = ClipRewardEnv(env)
env = gym.wrappers.ResizeObservation(env, (84, 84))
env = gym.wrappers.GrayScaleObservation(env)
env = gym.wrappers.FrameStack(env, 4)
env.seed(seed)
env.action_space.seed(seed)
env.observation_space.seed(seed)
return env
return thunk
Per conversation with @Trinkle23897, we think the problem is reward counting. In my implementation, I counted the rewards from all five lives, but since I was using reward_clip=True
in envpool.make
, I was counting the clipped rewards from 5 episodes. Maybe the best solution is to include a game_score
key in the info
variable.
class RecordEpisodeStatistics(gym.Wrapper):
def __init__(self, env, deque_size=100):
super(RecordEpisodeStatistics, self).__init__(env)
self.num_envs = getattr(env, "num_envs", 1)
self.episode_returns = None
self.episode_lengths = None
# get if the env has lives
self.has_lives = False
env.reset()
info = env.step(np.zeros(self.num_envs, dtype=int))[-1]
if info["lives"].sum() > 0:
self.has_lives = True
print("env has lives")
def reset(self, **kwargs):
observations = super(RecordEpisodeStatistics, self).reset(**kwargs)
self.episode_returns = np.zeros(self.num_envs, dtype=np.float32)
self.episode_lengths = np.zeros(self.num_envs, dtype=np.int32)
self.lives = np.zeros(self.num_envs, dtype=np.int32)
self.returned_episode_returns = np.zeros(self.num_envs, dtype=np.float32)
self.returned_episode_lengths = np.zeros(self.num_envs, dtype=np.int32)
return observations
def step(self, action):
observations, rewards, dones, infos = super(RecordEpisodeStatistics, self).step(
action
)
self.episode_returns += rewards
self.episode_lengths += 1
self.returned_episode_returns[:] = self.episode_returns
self.returned_episode_lengths[:] = self.episode_lengths
all_lives_exhausted = infos["lives"] == 0
if self.has_lives:
self.episode_returns *= (1 - all_lives_exhausted)
self.episode_lengths *= (1 - all_lives_exhausted)
else:
self.episode_returns *= (1 - dones)
self.episode_lengths *= (1 - dones)
infos["r"] = self.returned_episode_returns
infos["l"] = self.returned_episode_lengths
return (
observations,
rewards,
dones,
infos,
)
Hey reporting back here. After using built wheels that fixes the clipped reward bug here: https://github.com/sail-sg/envpool/actions/runs/1690252100, I was able to reproduce PPO's old performance in Breakout. (See the blue line)
Thanks @Trinkle23897 for the helpful fix
Describe the bug
PPO can no longer reproduce 400 game scores in the
Breakout-v5
given 10M steps of training (same hyperparameters) as it can inBreakoutNoFrameskip-v4
.To Reproduce
Run the https://wandb.ai/costa-huang/cleanRL/runs/26k4q5jo/code?workspace=user-costa-huang to reproduce envpool's results and https://wandb.ai/costa-huang/cleanRL/runs/1ngqmz96/code?workspace=user-costa-huang to reproduce
BreakoutNoFrameskip-v4
results.Expected behavior
PPO should obtain 400 game scores in the
Breakout-v5
given 10M steps of trainingSystem info
Describe the characteristic of your environment:
Reason and Possible fixes
I ran the gym's
ALE/Breakout-v5
as well and got a regression as well as shown below, but looking into it was becauseALE/Breakout-v5
by default uses the full action space (14 discrete actions), whereas theBreakout-v5
has the minimal 4 discrete actions. So I have no idea why the regression happens with envpool...Checklist