[BUG] Breakout-v5 Performance Regression

vwxyzjn commented 2 years ago

Describe the bug

PPO can no longer reproduce 400 game scores in the Breakout-v5 given 10M steps of training (same hyperparameters) as it can in BreakoutNoFrameskip-v4.

To Reproduce

Run the https://wandb.ai/costa-huang/cleanRL/runs/26k4q5jo/code?workspace=user-costa-huang to reproduce envpool's results and https://wandb.ai/costa-huang/cleanRL/runs/1ngqmz96/code?workspace=user-costa-huang to reproduce BreakoutNoFrameskip-v4 results.

Expected behavior

PPO should obtain 400 game scores in the Breakout-v5 given 10M steps of training

System info

Describe the characteristic of your environment:

Describe how the library was installed (pip, source, ...)
Python version
Versions of any other relevant libraries

import envpool, numpy, sys
print(envpool.__version__, numpy.__version__, sys.version, sys.platform)

>>> import envpool, numpy, sys
__, numpy.__version__, sys.version, sys.platform)
>>> print(envpool.__version__, numpy.__version__, sys.version, sys.platform)
0.4.3 1.21.5 3.9.5 (default, Jul 19 2021, 13:27:26) 
[GCC 10.3.0] linux

Reason and Possible fixes

I ran the gym's ALE/Breakout-v5 as well and got a regression as well as shown below, but looking into it was because ALE/Breakout-v5 by default uses the full action space (14 discrete actions), whereas the Breakout-v5 has the minimal 4 discrete actions. So I have no idea why the regression happens with envpool...

Checklist

[ x ] I have checked that there is no similar issue in the repo (required)
[ x ] I have read the documentation (required)
[ x ] I have provided a minimal working example to reproduce the bug (required)

vwxyzjn commented 2 years ago

Got some feedback from @JesseFarebro regarding envpool's Breakout-v5 vs ALE/Breakout-v5:

They envpool aren't doing terminal signal on loss of life which in Breakout is the single most important setting. (we don't recommend it be used)

They are doing NOOPs without sticky actions and you're doing both (no one should be using NOOPs anymore).

They don't do reward clipping by default? Check the magnitude of the rewards if you aren't clipping yourself (this is fine)

As shown in the script, I was already doing

     envs = envpool.make(
        args.gym_id,
        env_type="gym",
        num_envs=args.num_envs,
        episodic_life=True,
        reward_clip=True,
    )

Maybe the reason is "They are doing NOOPs without sticky actions and you're doing both (no one should be using NOOPs anymore)."?

Trinkle23897 commented 2 years ago

https://envpool.readthedocs.io/en/latest/api/atari.html

(no one should be using NOOPs anymore)."?

set noop_max=0

vwxyzjn commented 2 years ago

Hmm but the BreakoutNoFrameskip-v4 result clearly used the noop_max.

def make_env(gym_id, seed, idx, capture_video, run_name):
    def thunk():
        env = gym.make(gym_id)
        env = gym.wrappers.RecordEpisodeStatistics(env)
        if capture_video:
            if idx == 0:
                env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
        env = NoopResetEnv(env, noop_max=30)
        env = MaxAndSkipEnv(env, skip=4)
        env = EpisodicLifeEnv(env)
        if "FIRE" in env.unwrapped.get_action_meanings():
            env = FireResetEnv(env)
        env = ClipRewardEnv(env)
        env = gym.wrappers.ResizeObservation(env, (84, 84))
        env = gym.wrappers.GrayScaleObservation(env)
        env = gym.wrappers.FrameStack(env, 4)
        env.seed(seed)
        env.action_space.seed(seed)
        env.observation_space.seed(seed)
        return env

    return thunk

vwxyzjn commented 2 years ago

Per conversation with @Trinkle23897, we think the problem is reward counting. In my implementation, I counted the rewards from all five lives, but since I was using reward_clip=True in envpool.make, I was counting the clipped rewards from 5 episodes. Maybe the best solution is to include a game_score key in the info variable.

class RecordEpisodeStatistics(gym.Wrapper):
    def __init__(self, env, deque_size=100):
        super(RecordEpisodeStatistics, self).__init__(env)
        self.num_envs = getattr(env, "num_envs", 1)
        self.episode_returns = None
        self.episode_lengths = None
        # get if the env has lives
        self.has_lives = False
        env.reset()
        info = env.step(np.zeros(self.num_envs, dtype=int))[-1]
        if info["lives"].sum() > 0:
            self.has_lives = True
            print("env has lives")

    def reset(self, **kwargs):
        observations = super(RecordEpisodeStatistics, self).reset(**kwargs)
        self.episode_returns = np.zeros(self.num_envs, dtype=np.float32)
        self.episode_lengths = np.zeros(self.num_envs, dtype=np.int32)
        self.lives = np.zeros(self.num_envs, dtype=np.int32)
        self.returned_episode_returns = np.zeros(self.num_envs, dtype=np.float32)
        self.returned_episode_lengths = np.zeros(self.num_envs, dtype=np.int32)
        return observations

    def step(self, action):
        observations, rewards, dones, infos = super(RecordEpisodeStatistics, self).step(
            action
        )
        self.episode_returns += rewards
        self.episode_lengths += 1
        self.returned_episode_returns[:] = self.episode_returns
        self.returned_episode_lengths[:] = self.episode_lengths
        all_lives_exhausted = infos["lives"] == 0
        if self.has_lives:
            self.episode_returns *= (1 - all_lives_exhausted)
            self.episode_lengths *= (1 - all_lives_exhausted)
        else:
            self.episode_returns *= (1 - dones)
            self.episode_lengths *= (1 - dones)
        infos["r"] = self.returned_episode_returns
        infos["l"] = self.returned_episode_lengths
        return (
            observations,
            rewards,
            dones,
            infos,
        )

vwxyzjn commented 2 years ago

Hey reporting back here. After using built wheels that fixes the clipped reward bug here: https://github.com/sail-sg/envpool/actions/runs/1690252100, I was able to reproduce PPO's old performance in Breakout. (See the blue line)

0463BC5E-AE51-46BE-91F3-711AC340F1AF

Thanks @Trinkle23897 for the helpful fix

sail-sg / envpool