sail-sg / envpool

C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.
https://envpool.readthedocs.io
Apache License 2.0
1.08k stars 99 forks source link

[BUG] Breakout-v5 Performance Regression #49

Closed vwxyzjn closed 2 years ago

vwxyzjn commented 2 years ago

Describe the bug

PPO can no longer reproduce 400 game scores in the Breakout-v5 given 10M steps of training (same hyperparameters) as it can in BreakoutNoFrameskip-v4.

image

To Reproduce

Run the https://wandb.ai/costa-huang/cleanRL/runs/26k4q5jo/code?workspace=user-costa-huang to reproduce envpool's results and https://wandb.ai/costa-huang/cleanRL/runs/1ngqmz96/code?workspace=user-costa-huang to reproduce BreakoutNoFrameskip-v4 results.

Expected behavior

PPO should obtain 400 game scores in the Breakout-v5 given 10M steps of training

System info

Describe the characteristic of your environment:

import envpool, numpy, sys
print(envpool.__version__, numpy.__version__, sys.version, sys.platform)

>>> import envpool, numpy, sys
__, numpy.__version__, sys.version, sys.platform)
>>> print(envpool.__version__, numpy.__version__, sys.version, sys.platform)
0.4.3 1.21.5 3.9.5 (default, Jul 19 2021, 13:27:26) 
[GCC 10.3.0] linux

Reason and Possible fixes

I ran the gym's ALE/Breakout-v5 as well and got a regression as well as shown below, but looking into it was because ALE/Breakout-v5 by default uses the full action space (14 discrete actions), whereas the Breakout-v5 has the minimal 4 discrete actions. So I have no idea why the regression happens with envpool...

image

Checklist

vwxyzjn commented 2 years ago

Got some feedback from @JesseFarebro regarding envpool's Breakout-v5 vs ALE/Breakout-v5:

  • They envpool aren't doing terminal signal on loss of life which in Breakout is the single most important setting. (we don't recommend it be used)
  • They are doing NOOPs without sticky actions and you're doing both (no one should be using NOOPs anymore).
  • They don't do reward clipping by default? Check the magnitude of the rewards if you aren't clipping yourself (this is fine)

As shown in the script, I was already doing

     envs = envpool.make(
        args.gym_id,
        env_type="gym",
        num_envs=args.num_envs,
        episodic_life=True,
        reward_clip=True,
    )

Maybe the reason is "They are doing NOOPs without sticky actions and you're doing both (no one should be using NOOPs anymore)."?

Trinkle23897 commented 2 years ago

https://envpool.readthedocs.io/en/latest/api/atari.html

(no one should be using NOOPs anymore)."?

set noop_max=0

vwxyzjn commented 2 years ago

Hmm but the BreakoutNoFrameskip-v4 result clearly used the noop_max.

def make_env(gym_id, seed, idx, capture_video, run_name):
    def thunk():
        env = gym.make(gym_id)
        env = gym.wrappers.RecordEpisodeStatistics(env)
        if capture_video:
            if idx == 0:
                env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
        env = NoopResetEnv(env, noop_max=30)
        env = MaxAndSkipEnv(env, skip=4)
        env = EpisodicLifeEnv(env)
        if "FIRE" in env.unwrapped.get_action_meanings():
            env = FireResetEnv(env)
        env = ClipRewardEnv(env)
        env = gym.wrappers.ResizeObservation(env, (84, 84))
        env = gym.wrappers.GrayScaleObservation(env)
        env = gym.wrappers.FrameStack(env, 4)
        env.seed(seed)
        env.action_space.seed(seed)
        env.observation_space.seed(seed)
        return env

    return thunk
vwxyzjn commented 2 years ago

Per conversation with @Trinkle23897, we think the problem is reward counting. In my implementation, I counted the rewards from all five lives, but since I was using reward_clip=True in envpool.make, I was counting the clipped rewards from 5 episodes. Maybe the best solution is to include a game_score key in the info variable.

class RecordEpisodeStatistics(gym.Wrapper):
    def __init__(self, env, deque_size=100):
        super(RecordEpisodeStatistics, self).__init__(env)
        self.num_envs = getattr(env, "num_envs", 1)
        self.episode_returns = None
        self.episode_lengths = None
        # get if the env has lives
        self.has_lives = False
        env.reset()
        info = env.step(np.zeros(self.num_envs, dtype=int))[-1]
        if info["lives"].sum() > 0:
            self.has_lives = True
            print("env has lives")

    def reset(self, **kwargs):
        observations = super(RecordEpisodeStatistics, self).reset(**kwargs)
        self.episode_returns = np.zeros(self.num_envs, dtype=np.float32)
        self.episode_lengths = np.zeros(self.num_envs, dtype=np.int32)
        self.lives = np.zeros(self.num_envs, dtype=np.int32)
        self.returned_episode_returns = np.zeros(self.num_envs, dtype=np.float32)
        self.returned_episode_lengths = np.zeros(self.num_envs, dtype=np.int32)
        return observations

    def step(self, action):
        observations, rewards, dones, infos = super(RecordEpisodeStatistics, self).step(
            action
        )
        self.episode_returns += rewards
        self.episode_lengths += 1
        self.returned_episode_returns[:] = self.episode_returns
        self.returned_episode_lengths[:] = self.episode_lengths
        all_lives_exhausted = infos["lives"] == 0
        if self.has_lives:
            self.episode_returns *= (1 - all_lives_exhausted)
            self.episode_lengths *= (1 - all_lives_exhausted)
        else:
            self.episode_returns *= (1 - dones)
            self.episode_lengths *= (1 - dones)
        infos["r"] = self.returned_episode_returns
        infos["l"] = self.returned_episode_lengths
        return (
            observations,
            rewards,
            dones,
            infos,
        )
vwxyzjn commented 2 years ago

Hey reporting back here. After using built wheels that fixes the clipped reward bug here: https://github.com/sail-sg/envpool/actions/runs/1690252100, I was able to reproduce PPO's old performance in Breakout. (See the blue line)

0463BC5E-AE51-46BE-91F3-711AC340F1AF

Thanks @Trinkle23897 for the helpful fix