[RLlib] PPO training on Atari Environment using standard hyperparameters gives poor results

rajfly commented 1 year ago

What happened + What you expected to happen

Hi, I recently tried to recreate the experiments from the original PPO paper. First I used Stable Baselines3 to do so and noticed that the reward generated for the Alien atari environment was close to what was shown on the paper. I utilised similar hyperparamters that were used in the original paper and the results did not deviate that much with a mean reward of more than 1500.

I recently tried to do the same with RLlib instead of stable baselines3. Using the same hyperparameter set, I tried to train the PPO algorithm on the Alien atari environment. I however only noticed a reward of not more than 100 at the end. I am not sure what went wrong during the training process and was hoping someone would be able to help me out.

Below is the list of intended hyperparameters which closely follow the original PPO paper. Using this hyperparameters Stable Baselines3 was able to train the Alien environment till a reward of more than 1500 after 10M timesteps. RLlib however struggled to even attain a reward of more than 100 after 10M timesteps:

Environment:

Max epsisode frames: 108k frames
Obs type: Grayscale
Frameskip (w/ max pooling): 4
Repeat action probability: 0.25
Full action space: False
Noop reset: 0
Terminal on life loss: False
Resize: 84 x 84
Scale observation: [0,1)
Reward clipped: [-1, 1]
Frame stack: 4

Networks:

NatureCNN
Shared Layers

Algorithm:

Horizon: 128
Adam stepsize: 0.00025
Num epochs: 3
Minibatch size: 32 x 8
Gamma: 0.99
Lambda: 0.95
Num actors: 8
Clipping param: 0.1
VF coeff: 1
Entropy coeff: 0.01

I have already posted this issue on https://discuss.ray.io but it got temporarily hidden by a spam filter. My username for discuss.ray.io is rajfly. If there are any admins out there, please allow my post. Thanks.

Versions / Dependencies

RLlib version: 2.4.0 (all my pervious experiments were with this version so to keep things consistent I am using the same version). I however also tested on the latest RLlib version and noticed the same issues. Python version = 3.9 OS = Ubuntu LTS

Reproduction script

This is the reproduction script, to run simply execute as such: python file_name.py --gpu 0 --env Alien --algo ppo

import os
import argparse
import random
import time
import uuid
import numpy as np
import gymnasium as gym
from gymnasium.wrappers import AtariPreprocessing, TransformReward, FrameStack

import tensorflow as tf
import torch
import torch.nn as nn

import ray
from ray import air
from ray.tune import TuneConfig
from ray.tune.tuner import Tuner
from ray.tune.registry import register_env
from ray.rllib.policy.policy import Policy
from ray.rllib.algorithms.dqn.dqn import DQN, DQNConfig
from ray.rllib.algorithms.ppo import PPO, PPOConfig
from ray.rllib.algorithms.dqn.dqn_tf_policy import DQNTFPolicy 
from ray.rllib.algorithms.dqn.dqn_torch_policy import DQNTorchPolicy 
from ray.rllib.utils.typing import AlgorithmConfigDict, ModelConfigDict
from ray.rllib.algorithms.algorithm import Algorithm
from ray.rllib.algorithms.algorithm_config import AlgorithmConfig
from typing import Optional, Type
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models import ModelCatalog

from torch.utils.tensorboard import SummaryWriter

def train_eval(config):

    os.environ["CUDA_VISIBLE_DEVICES"] = f'{config.gpu}'

    # env done
    def env_creator(env_config):
        env = gym.make(f"ALE/{config.env}-v5", frameskip=1, render_mode='rgb_array')
        env = AtariPreprocessing(env, noop_max=0, scale_obs=True)
        env = TransformReward(env, lambda x: np.clip(x, -1, 1))
        env = FrameStack(env, 4)
        return env

    if config.algo == 'ppo':
        # done
        class TorchNature(TorchModelV2, nn.Module):
            def __init__(self, obs_space: gym.spaces.Space, action_space: gym.spaces.Space, num_outputs: int, model_config: ModelConfigDict, name: str):
                super().__init__(obs_space, action_space, num_outputs, model_config, name)
                nn.Module.__init__(self)

                self._model = nn.Sequential(
                    nn.Conv2d(4, 32, 8, 4, 0),
                    nn.ReLU(),
                    nn.Conv2d(32, 64, 4, 2, 0),
                    nn.ReLU(),
                    nn.Conv2d(64, 64, 3, 1, 0),
                    nn.ReLU(),
                    nn.Flatten(),
                    nn.Linear(3136, 512),
                    nn.ReLU(),
                )
                self._pi = nn.Sequential(nn.Linear(512, num_outputs))
                self._vf = nn.Sequential(nn.Linear(512, 1))

            def forward(self, input_dict, state, seq_lens):
                self._out = self._model(input_dict['obs'].float())
                pi_out = self._pi(self._out)
                return pi_out, []

            def value_function(self):
                return torch.reshape(self._vf(self._out), [-1])

        # register env, models
        register_env(f"{config.env}_custom", env_creator=env_creator)
        ModelCatalog.register_custom_model("TorchNature", TorchNature)

        param_space = PPOConfig()
        # done
        param_space = param_space.training(
            gamma=0.99,
            lr=0.00025,
            # grad_clip=None,
            train_batch_size=128*8,
            model={
                '_disable_preprocessor_api': True,
                'custom_model': 'TorchNature'
            },
            optimizer={'eps': 1e-8},
            # lr_schedule=None,
            use_critic=True,
            use_gae=True,
            lambda_=0.95,
            kl_coeff=0.0,
            sgd_minibatch_size=32*8,
            num_sgd_iter=3,
            vf_loss_coeff=1.0,
            entropy_coeff=0.01,
            # entropy_coeff_schedule=None,
            clip_param=0.1,
            # vf_clip_param=float('inf'),
        )
        param_space = param_space.environment(f"{config.env}_custom", render_env=False, clip_rewards=False, normalize_actions=False, clip_actions=False, auto_wrap_old_gym_envs=False)
        param_space = param_space.framework('torch', eager_tracing=True)
        param_space = param_space.rollouts(
            num_rollout_workers=8, 
            num_envs_per_worker=1, 
            create_env_on_local_worker=False,
            sample_async=False,
            rollout_fragment_length=128, 
            batch_mode='truncate_episodes', 
            preprocessor_pref=None, 
            observation_filter="NoFilter"
        )
        param_space = param_space.evaluation(evaluation_interval=None)
        param_space = param_space.experimental(_disable_preprocessor_api=True)
        param_space = param_space.debugging(logger_config={'type': "ray.tune.logger.NoopLogger"})
        param_space = param_space.resources(num_gpus=1, num_cpus_per_worker=1)
        trainer = PPO(config=param_space)

        train_steps = 10000000
        out_path = os.path.join(os.getcwd(), 'runs', "rllib_torch", 'ppo',f'train_eval_{config.env}_{uuid.uuid4()}')
        writer = SummaryWriter(log_dir=out_path)
        past_episodes = 0
        start = time.time()
        while True:
            results = trainer.train()
            timesteps = results['timesteps_total']
            mean_reward = results['episode_reward_mean']
            episodes_total = results['episodes_total']
            if episodes_total > (past_episodes+100):
                past_episodes = episodes_total
                writer.add_scalar('Timestep/reward', mean_reward, timesteps)
            if timesteps >= train_steps:
                break
        writer.close()
        end = time.time()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--env', type=str, help='Specify gym environment to use', required=True)
    parser.add_argument('--gpu', type=int, help='Specify GPU index', required=True)
    parser.add_argument('--algo', choices=['dqn', 'ppo'], type=str, help='Specify algorithm to use', required=True)
    args = parser.parse_args()
    train_eval(args)

Issue Severity

High: It blocks me from completing my task.

rajfly commented 1 year ago

Update: After some testing, I found out that the problem lies with the custom env function. After changing the env from a custom function to simply using the ones declared in Ray's registry with Rays preprocessing, the PPO training results seems to be ok. Though, I am not sure why this is so. Perhaps there is a bug with passing a custom env fn.

ArturNiederfahrenhorst commented 12 months ago

def env_creator(env_config):
        env = gym.make(f"ALE/{config.env}-v5", frameskip=1, render_mode='rgb_array')
        env = AtariPreprocessing(env, noop_max=0, scale_obs=True)
        env = TransformReward(env, lambda x: np.clip(x, -1, 1))
        env = FrameStack(env, 4)
        return env

As you noted, this is different from what RLlib does internally. RLlib has a range of fine-tuned settings under the tuned_examples folder. What happens if you use one of these (ray/rllib/tuned_examples/ppo/atari-ppo.yaml) and apply it to your problem, how are the results then?

ray-project / ray