Unable to replicate original PPO performance

rajfly commented 1 month ago

[x] I have marked all applicable categories:
- [x] exception-raising bug
- [x] RL algorithm bug
- [ ] documentation request (i.e. "X is missing from the documentation.")
- [ ] new feature request
- [ ] design request (i.e. "X should be changed to Y.")
[x] I have visited the source website
[x] I have searched through the issue tracker for duplicates

[x] I have mentioned version numbers, operating system and environment, where applicable:

import tianshou, gymnasium as gym, torch, numpy, sys
print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
# output: 1.0.0 0.28.1 2.3.0+cu121 1.24.4 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] linux

I can’t seem to replicate the original PPO algorithm's performance when using Tianshou's PPO implementation. The hyperparameters used are listed below. It follows the hyperparameters discussed in an ICLR Blog in aims to replicate the results from the original PPO paper (without LSTM).

Hyperparameters

# Environment
Max Frames Per Episode = 108000
Frameskip = 4
Max Of Last 2 Frames = True
Max Steps Per Episode = 27000
Framestack = 4

Observation Type = Grayscale
Frame Size = 84 x 84

Max No Operation Actions = 30
Repeat Action Probability = 0.0

Terminal On Life Loss = True
Fire Action on Reset = True
Reward Clip = {-1, 0, 1}
Full Action Space = False

# Algorithm
Neural Network Feature Extractor = Nature CNN
Neural Network Policy Head = Linear Layer with n_actions output features
Neural Network Value Head = Linear Layer with 1 output feature
Shared Feature Extractor = True
Orthogonal Initialization = True
Scale Images to [0, 1] = True
Optimizer = Adam with 1e-5 Epsilon

Learning Rate = 2.5e-4
Decay Learning Rate = True

Number of Environments = 8
Number of Steps = 128
Batch Size = 256
Number of Minibatches = 4
Number of Epochs = 4
Gamma = 0.99
GAE Lambda = 0.95
Clip Range = 0.1
VF Clip Range = 0.1
Normalize Advantage = True
Entropy Coefficient = 0.01
VF Coefficient = 0.5
Max Gradient Normalization = 0.5
Use Target KL = False
Total Timesteps = 10000000
Log Interval = 1
Evaluation Episodes = 100
Deterministic Evaluation = False

Seed = Random
Number of Trials = 5

I have tried these same hyperparameters with the Baselines, Stable Baselines3, and CleanRL implementations of the PPO algorithm and they all achieved the expected results. However, the Tianshou agent fails to train at all, as seen in the training curves below (Tianshou's PPO trials are shown in green). Am I missing something in my Tianshou configuration (see reproduction scripts) or is there a bug (or intentional discrepancy) in Tianshou's PPO implementation?

Tianshou training curves in green for 5 games when compared to other implementations

NOTE: The y-axis and x-axis represents mean reward and in-game frames (total of 40 million) respectively.

Other Issues Found Also, another issue found is that for some games like Atlantis, BankHeist, or YarsRevenge, training can sometimes randomly stop with the following error, though I am not entirely sure why:

Reproduction Scripts Run command: python ppo_atari.py --gpu 0 --env Alien --trials 5

_Main Script (ppoatari.py):

import argparse
import json
import os
import pathlib
import time
import uuid

import numpy as np
import pandas as pd
import torch
from atari_network import DQN, layer_init, scale_obs
from atari_wrapper import make_atari_env
from common import TrainLogger
from tianshou.data import Collector, VectorReplayBuffer
from tianshou.policy import PPOPolicy
from tianshou.trainer import OnpolicyTrainer
from tianshou.utils.net.common import ActorCritic
from tianshou.utils.net.discrete import Actor, Critic
from torch import nn
from torch.distributions import Categorical, Distribution
from torch.optim.lr_scheduler import LambdaLR

def actor_init(layer):
    if isinstance(layer, nn.Linear):
        torch.nn.init.orthogonal_(layer.weight, 0.01)
        torch.nn.init.constant_(layer.bias, 0.0)

def critic_init(layer):
    if isinstance(layer, nn.Linear):
        torch.nn.init.orthogonal_(layer.weight, 1)
        torch.nn.init.constant_(layer.bias, 0.0)

def train_atari(args: argparse.Namespace):
    # make env
    env, train_envs, test_envs = make_atari_env(
        task=f"{args.env}NoFrameskip-v4", seed=args.seed, training_num=8, test_num=8
    )

    state_shape = env.observation_space.shape or env.observation_space.n
    action_shape = env.action_space.shape or env.action_space.n

    # seed
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)

    # model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    net = DQN(
        *state_shape,
        action_shape,
        device=device,
        features_only=True,
        output_dim=512,
        layer_init=layer_init,
    )
    net = scale_obs(net)
    actor = Actor(net, action_shape, softmax_output=False, device=device)
    critic = Critic(net, device=device)
    actor.last.apply(actor_init)
    critic.last.apply(critic_init)

    optim = torch.optim.Adam(
        ActorCritic(actor, critic).parameters(), lr=2.5e-4, eps=1e-5
    )

    # decay learning rate to 0 linearly
    step_per_collect = 128 * 8
    step_per_epoch = 128 * 8
    epoch = int(10000000 // (128 * 8))
    lr_scheduler = LambdaLR(optim, lr_lambda=lambda e: 1 - e / epoch)

    def dist(logits: torch.Tensor) -> Distribution:
        return Categorical(logits=logits)

    # policy
    policy: PPOPolicy = PPOPolicy(
        actor=actor,
        critic=critic,
        optim=optim,
        dist_fn=dist,
        action_space=env.action_space,
        eps_clip=0.1,
        dual_clip=None,
        value_clip=True,
        advantage_normalization=True,
        recompute_advantage=False,
        vf_coef=0.5,
        ent_coef=0.01,
        max_grad_norm=0.5,
        gae_lambda=0.95,
        discount_factor=0.99,
        reward_normalization=False,
        deterministic_eval=False,
        observation_space=env.observation_space,
        action_scaling=False,
        lr_scheduler=lr_scheduler,
    ).to(device)

    train_buffer = VectorReplayBuffer(
        128 * 8,
        buffer_num=len(train_envs),
        ignore_obs_next=True,
        save_only_last_obs=True,
        stack_num=4,
    )

    train_collector = Collector(
        policy, train_envs, train_buffer, exploration_noise=False
    )

    logger = TrainLogger(
        train_interval=128 * 8,
    )

    start_time = time.time()

    # train
    result = OnpolicyTrainer(
        policy=policy,
        max_epoch=epoch,
        batch_size=256,
        train_collector=train_collector,
        test_collector=None,
        buffer=None,
        step_per_epoch=step_per_epoch,
        repeat_per_collect=4,
        episode_per_test=0,
        update_per_step=1.0,
        step_per_collect=step_per_collect,
        episode_per_collect=None,
        train_fn=None,
        test_fn=None,
        stop_fn=None,
        save_best_fn=None,
        save_checkpoint_fn=None,
        resume_from_log=False,
        reward_metric=None,
        logger=logger,
        verbose=True,
        show_progress=True,
        test_in_train=False,
        save_fn=None,
    ).run()

    train_end_time = time.time()

    progress_df = pd.DataFrame(logger.progress_data)
    progress_df.to_csv(os.path.join(args.path, "progress.csv"), index=False)

    # eval
    policy.eval()
    test_collector = Collector(policy, test_envs, exploration_noise=False)
    result = test_collector.collect(n_episode=100)
    eval_end_time = time.time()
    args.eval_mean_reward = result.returns_stat.mean
    args.training_time_h = ((train_end_time - start_time) / 60) / 60
    args.total_time_h = ((eval_end_time - start_time) / 60) / 60

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-g",
        "--gpu",
        type=int,
        help="Specify GPU index",
        default=0,
    )
    parser.add_argument(
        "-e",
        "--env",
        type=str,
        help="Specify Atari or MuJoCo environment w/o version",
        default="Pong",
    )
    parser.add_argument(
        "-t",
        "--trials",
        type=int,
        help="Specify number of trials",
        default=5,
    )
    args = parser.parse_args()
    for _ in range(args.trials):
        args.id = uuid.uuid4().hex
        args.path = os.path.join("trials", "ppo", args.env, args.id)
        args.seed = int(time.time())

        # create dir
        pathlib.Path(args.path).mkdir(parents=True, exist_ok=True)

        # set gpu
        os.environ["CUDA_VISIBLE_DEVICES"] = f"{args.gpu}"

        train_atari(args)

        # save trial info
        with open(os.path.join(args.path, "info.json"), "w") as f:
            json.dump(vars(args), f, indent=4)

Dependencies of Main Script (include these 3 scripts in same directory as the main script):

atari_network.py

from collections.abc import Callable, Sequence
from typing import Any

import numpy as np
import torch
from tianshou.highlevel.env import Environments
from tianshou.highlevel.module.actor import ActorFactory
from tianshou.highlevel.module.core import TDevice
from tianshou.highlevel.module.intermediate import (
    IntermediateModule,
    IntermediateModuleFactory,
)
from tianshou.utils.net.discrete import Actor, NoisyLinear
from torch import nn

def layer_init(
    layer: nn.Module, std: float = np.sqrt(2), bias_const: float = 0.0
) -> nn.Module:
    torch.nn.init.orthogonal_(layer.weight, std)
    torch.nn.init.constant_(layer.bias, bias_const)
    return layer

class ScaledObsInputModule(torch.nn.Module):
    def __init__(self, module: torch.nn.Module, denom: float = 255.0) -> None:
        super().__init__()
        self.module = module
        self.denom = denom
        # This is required such that the value can be retrieved by downstream modules (see usages of get_output_dim)
        self.output_dim = module.output_dim

    def forward(
        self,
        obs: np.ndarray | torch.Tensor,
        state: Any | None = None,
        info: dict[str, Any] | None = None,
    ) -> tuple[torch.Tensor, Any]:
        if info is None:
            info = {}
        return self.module.forward(obs / self.denom, state, info)

def scale_obs(module: nn.Module, denom: float = 255.0) -> nn.Module:
    return ScaledObsInputModule(module, denom=denom)

class DQN(nn.Module):
    """Reference: Human-level control through deep reinforcement learning.

    For advanced usage (how to customize the network), please refer to
    :ref:`build_the_network`.
    """

    def __init__(
        self,
        c: int,
        h: int,
        w: int,
        action_shape: Sequence[int],
        device: str | int | torch.device = "cpu",
        features_only: bool = False,
        output_dim: int | None = None,
        layer_init: Callable[[nn.Module], nn.Module] = lambda x: x,
    ) -> None:
        super().__init__()
        self.device = device
        self.net = nn.Sequential(
            layer_init(nn.Conv2d(c, 32, kernel_size=8, stride=4)),
            nn.ReLU(inplace=True),
            layer_init(nn.Conv2d(32, 64, kernel_size=4, stride=2)),
            nn.ReLU(inplace=True),
            layer_init(nn.Conv2d(64, 64, kernel_size=3, stride=1)),
            nn.ReLU(inplace=True),
            nn.Flatten(),
        )
        with torch.no_grad():
            self.output_dim = int(np.prod(self.net(torch.zeros(1, c, h, w)).shape[1:]))
        if not features_only:
            self.net = nn.Sequential(
                self.net,
                layer_init(nn.Linear(self.output_dim, 512)),
                nn.ReLU(inplace=True),
                layer_init(nn.Linear(512, int(np.prod(action_shape)))),
            )
            self.output_dim = np.prod(action_shape)
        elif output_dim is not None:
            self.net = nn.Sequential(
                self.net,
                layer_init(nn.Linear(self.output_dim, output_dim)),
                nn.ReLU(inplace=True),
            )
            self.output_dim = output_dim

    def forward(
        self,
        obs: np.ndarray | torch.Tensor,
        state: Any | None = None,
        info: dict[str, Any] | None = None,
    ) -> tuple[torch.Tensor, Any]:
        r"""Mapping: s -> Q(s, \*)."""
        if info is None:
            info = {}
        obs = torch.as_tensor(obs, device=self.device, dtype=torch.float32)
        return self.net(obs), state

class ActorFactoryAtariDQN(ActorFactory):
    def __init__(
        self,
        hidden_size: int | Sequence[int],
        scale_obs: bool,
        features_only: bool,
    ) -> None:
        self.hidden_size = hidden_size
        self.scale_obs = scale_obs
        self.features_only = features_only

    def create_module(self, envs: Environments, device: TDevice) -> Actor:
        net = DQN(
            *envs.get_observation_shape(),
            envs.get_action_shape(),
            device=device,
            features_only=self.features_only,
            output_dim=self.hidden_size,
            layer_init=layer_init,
        )
        if self.scale_obs:
            net = scale_obs(net)
        return Actor(
            net, envs.get_action_shape(), device=device, softmax_output=False
        ).to(device)

class IntermediateModuleFactoryAtariDQN(IntermediateModuleFactory):
    def __init__(self, features_only: bool = False, net_only: bool = False) -> None:
        self.features_only = features_only
        self.net_only = net_only

    def create_intermediate_module(
        self, envs: Environments, device: TDevice
    ) -> IntermediateModule:
        dqn = DQN(
            *envs.get_observation_shape(),
            envs.get_action_shape(),
            device=device,
            features_only=self.features_only,
        ).to(device)
        module = dqn.net if self.net_only else dqn
        return IntermediateModule(module, dqn.output_dim)

class IntermediateModuleFactoryAtariDQNFeatures(IntermediateModuleFactoryAtariDQN):
    def __init__(self) -> None:
        super().__init__(features_only=True, net_only=True)

atari_wrapper.py

# Borrow a lot from openai baselines:
# https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py
import logging
from collections import deque

import cv2
import gymnasium as gym
import numpy as np
from gymnasium import Env
from tianshou.highlevel.env import EnvFactoryRegistered, EnvMode, VectorEnvType
from tianshou.highlevel.trainer import EpochStopCallback, TrainingContext

log = logging.getLogger(__name__)

def _parse_reset_result(reset_result):
    contains_info = (
        isinstance(reset_result, tuple)
        and len(reset_result) == 2
        and isinstance(reset_result[1], dict)
    )
    if contains_info:
        return reset_result[0], reset_result[1], contains_info
    return reset_result, {}, contains_info

class NoopResetEnv(gym.Wrapper):
    """Sample initial states by taking random number of no-ops on reset.

    No-op is assumed to be action 0.

    :param gym.Env env: the environment to wrap.
    :param int noop_max: the maximum value of no-ops to run.
    """

    def __init__(self, env, noop_max=30) -> None:
        super().__init__(env)
        self.noop_max = noop_max
        self.noop_action = 0
        assert env.unwrapped.get_action_meanings()[0] == "NOOP"

    def reset(self, **kwargs):
        _, info, return_info = _parse_reset_result(self.env.reset(**kwargs))
        if hasattr(self.unwrapped.np_random, "integers"):
            noops = self.unwrapped.np_random.integers(1, self.noop_max + 1)
        else:
            noops = self.unwrapped.np_random.randint(1, self.noop_max + 1)
        for _ in range(noops):
            step_result = self.env.step(self.noop_action)
            if len(step_result) == 4:
                obs, rew, done, info = step_result
            else:
                obs, rew, term, trunc, info = step_result
                done = term or trunc
            if done:
                obs, info, _ = _parse_reset_result(self.env.reset())
        if return_info:
            return obs, info
        return obs

class MaxAndSkipEnv(gym.Wrapper):
    """Return only every `skip`-th frame (frameskipping) using most recent raw observations (for max pooling across time steps).

    :param gym.Env env: the environment to wrap.
    :param int skip: number of `skip`-th frame.
    """

    def __init__(self, env, skip=4) -> None:
        super().__init__(env)
        self._skip = skip

    def step(self, action):
        """Step the environment with the given action.

        Repeat action, sum reward, and max over last observations.
        """
        obs_list, total_reward = [], 0.0
        new_step_api = False
        for _ in range(self._skip):
            step_result = self.env.step(action)
            if len(step_result) == 4:
                obs, reward, done, info = step_result
            else:
                obs, reward, term, trunc, info = step_result
                done = term or trunc
                new_step_api = True
            obs_list.append(obs)
            total_reward += reward
            if done:
                break
        max_frame = np.max(obs_list[-2:], axis=0)
        if new_step_api:
            return max_frame, total_reward, term, trunc, info

        return max_frame, total_reward, done, info

class EpisodicLifeEnv(gym.Wrapper):
    """Make end-of-life == end-of-episode, but only reset on true game over.

    It helps the value estimation.

    :param gym.Env env: the environment to wrap.
    """

    def __init__(self, env) -> None:
        super().__init__(env)
        self.lives = 0
        self.was_real_done = True
        self._return_info = False

    def step(self, action):
        step_result = self.env.step(action)
        if len(step_result) == 4:
            obs, reward, done, info = step_result
            new_step_api = False
        else:
            obs, reward, term, trunc, info = step_result
            done = term or trunc
            new_step_api = True

        self.was_real_done = done
        # check current lives, make loss of life terminal, then update lives to
        # handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if 0 < lives < self.lives:
            # for Qbert sometimes we stay in lives == 0 condition for a few
            # frames, so its important to keep lives > 0, so that we only reset
            # once the environment is actually done.
            done = True
            term = True
        self.lives = lives
        if new_step_api:
            return obs, reward, term, trunc, info
        return obs, reward, done, info

    def reset(self, **kwargs):
        """Calls the Gym environment reset, only when lives are exhausted.

        This way all states are still reachable even though lives are episodic, and
        the learner need not know about any of this behind-the-scenes.
        """
        if self.was_real_done:
            obs, info, self._return_info = _parse_reset_result(self.env.reset(**kwargs))
        else:
            # no-op step to advance from terminal/lost life state
            step_result = self.env.step(0)
            obs, info = step_result[0], step_result[-1]
        self.lives = self.env.unwrapped.ale.lives()
        if self._return_info:
            return obs, info
        return obs

class FireResetEnv(gym.Wrapper):
    """Take action on reset for environments that are fixed until firing.

    Related discussion: https://github.com/openai/baselines/issues/240.

    :param gym.Env env: the environment to wrap.
    """

    def __init__(self, env) -> None:
        super().__init__(env)
        assert env.unwrapped.get_action_meanings()[1] == "FIRE"
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def reset(self, **kwargs):
        _, _, return_info = _parse_reset_result(self.env.reset(**kwargs))
        obs = self.env.step(1)[0]
        return (obs, {}) if return_info else obs

class WarpFrame(gym.ObservationWrapper):
    """Warp frames to 84x84 as done in the Nature paper and later work.

    :param gym.Env env: the environment to wrap.
    """

    def __init__(self, env) -> None:
        super().__init__(env)
        self.size = 84
        self.observation_space = gym.spaces.Box(
            low=np.min(env.observation_space.low),
            high=np.max(env.observation_space.high),
            shape=(self.size, self.size),
            dtype=env.observation_space.dtype,
        )

    def observation(self, frame):
        """Returns the current observation from a frame."""
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        return cv2.resize(frame, (self.size, self.size), interpolation=cv2.INTER_AREA)

class ScaledFloatFrame(gym.ObservationWrapper):
    """Normalize observations to 0~1.

    :param gym.Env env: the environment to wrap.
    """

    def __init__(self, env) -> None:
        super().__init__(env)
        low = np.min(env.observation_space.low)
        high = np.max(env.observation_space.high)
        self.bias = low
        self.scale = high - low
        self.observation_space = gym.spaces.Box(
            low=0.0,
            high=1.0,
            shape=env.observation_space.shape,
            dtype=np.float32,
        )

    def observation(self, observation):
        return (observation - self.bias) / self.scale

class ClipRewardEnv(gym.RewardWrapper):
    """clips the reward to {+1, 0, -1} by its sign.

    :param gym.Env env: the environment to wrap.
    """

    def __init__(self, env) -> None:
        super().__init__(env)
        self.reward_range = (-1, 1)

    def reward(self, reward):
        """Bin reward to {+1, 0, -1} by its sign. Note: np.sign(0) == 0."""
        return np.sign(reward)

class FrameStack(gym.Wrapper):
    """Stack n_frames last frames.

    :param gym.Env env: the environment to wrap.
    :param int n_frames: the number of frames to stack.
    """

    def __init__(self, env, n_frames) -> None:
        super().__init__(env)
        self.n_frames = n_frames
        self.frames = deque([], maxlen=n_frames)
        shape = (n_frames, *env.observation_space.shape)
        self.observation_space = gym.spaces.Box(
            low=np.min(env.observation_space.low),
            high=np.max(env.observation_space.high),
            shape=shape,
            dtype=env.observation_space.dtype,
        )

    def reset(self, **kwargs):
        obs, info, return_info = _parse_reset_result(self.env.reset(**kwargs))
        for _ in range(self.n_frames):
            self.frames.append(obs)
        return (self._get_ob(), info) if return_info else self._get_ob()

    def step(self, action):
        step_result = self.env.step(action)
        if len(step_result) == 4:
            obs, reward, done, info = step_result
            new_step_api = False
        else:
            obs, reward, term, trunc, info = step_result
            new_step_api = True
        self.frames.append(obs)
        if new_step_api:
            return self._get_ob(), reward, term, trunc, info
        return self._get_ob(), reward, done, info

    def _get_ob(self):
        # the original wrapper use `LazyFrames` but since we use np buffer,
        # it has no effect
        return np.stack(self.frames, axis=0)

def wrap_deepmind(
    env: Env,
    episode_life=True,
    clip_rewards=True,
    frame_stack=4,
    scale=False,
    warp_frame=True,
):
    """Configure environment for DeepMind-style Atari.

    The observation is channel-first: (c, h, w) instead of (h, w, c).

    :param env: the Atari environment to wrap.
    :param bool episode_life: wrap the episode life wrapper.
    :param bool clip_rewards: wrap the reward clipping wrapper.
    :param int frame_stack: wrap the frame stacking wrapper.
    :param bool scale: wrap the scaling observation wrapper.
    :param bool warp_frame: wrap the grayscale + resize observation wrapper.
    :return: the wrapped atari environment.
    """
    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
    if episode_life:
        env = EpisodicLifeEnv(env)
    if "FIRE" in env.unwrapped.get_action_meanings():
        env = FireResetEnv(env)
    if warp_frame:
        env = WarpFrame(env)
    if scale:
        env = ScaledFloatFrame(env)
    if clip_rewards:
        env = ClipRewardEnv(env)
    if frame_stack:
        env = FrameStack(env, frame_stack)
    return env

def make_atari_env(
    task,
    seed,
    training_num,
    test_num,
):
    """Wrapper function for Atari env.

    :return: a tuple of (single env, training envs, test envs).
    """
    env_factory = AtariEnvFactory(task, seed)
    envs = env_factory.create_envs(training_num, test_num)
    return envs.env, envs.train_envs, envs.test_envs

class AtariEnvFactory(EnvFactoryRegistered):
    def __init__(
        self,
        task: str,
        seed: int,
    ) -> None:
        assert "NoFrameskip-v4" in task
        super().__init__(
            task=task,
            seed=seed,
            venv_type=VectorEnvType.SUBPROC_SHARED_MEM,
            envpool_factory=None,
        )

    def create_env(self, mode: EnvMode) -> Env:
        env = super().create_env(mode)
        return wrap_deepmind(
            env,
            episode_life=True,
            clip_rewards=True,
            frame_stack=4,
            scale=False,
            warp_frame=True,
        )

common.py

from collections import deque
from collections.abc import Callable

import numpy as np
import pandas as pd
from tianshou.utils.logger.base import VALID_LOG_VALS_TYPE, BaseLogger, DataScope

class TrainLogger(BaseLogger):
    """A logger that stores global step and running mean reward (train)."""

    def __init__(self, train_interval) -> None:
        super().__init__(train_interval, 0, 0, 0)
        self.progress_data = {"global_step": [], "mean_reward": []}
        self.reward_buffer = deque(maxlen=100)

    def log_train_data(self, log_data: dict, step: int) -> None:
        """Log step and mean reward.

        :param log_data: a dict containing the information returned by the collector during the train step.
        :param step: stands for the timestep the collector result is logged.
        """
        if step - self.last_log_train_step >= self.train_interval:
            self.reward_buffer.extend(log_data["returns"])
            mean_reward = (
                np.nan
                if len(self.reward_buffer) == 0
                else float(np.mean(self.reward_buffer))
            )
            self.progress_data["global_step"].append(step)
            self.progress_data["mean_reward"].append(mean_reward)
            self.last_log_train_step = step

    def write(
        self, step_type: str, step: int, data: dict[str, VALID_LOG_VALS_TYPE]
    ) -> None:
        pass

    def log_test_data(self, log_data: dict, step: int) -> None:
        pass

    def log_update_data(self, log_data: dict, step: int) -> None:
        pass

    def log_info_data(self, log_data: dict, step: int) -> None:
        pass

    def save_data(
        self,
        epoch: int,
        env_step: int,
        gradient_step: int,
        save_checkpoint_fn: Callable[[int, int, int], str] | None = None,
    ) -> None:
        pass

    def restore_data(self) -> tuple[int, int, int]:
        return 0, 0, 0

dantp-ai commented 1 month ago

@rajfly Are you able to use examples.atari.atari_ppo.py passing the arguments from your custom configuration or do you need to build your own custom ppo trainer on atari?

rajfly commented 1 month ago

@rajfly Are you able to use examples.atari.atari_ppo.py passing the arguments from your custom configuration or do you need to build your own custom ppo trainer on atari?

@dantp-ai Hi, thanks for your prompt reply. In fact, I used examples.atari.atari_ppo.py as a base and modified from there to fit the original PPO implementation. For example, in the original PPO implementation by Baselines, the two output heads in the PPO architecture were initialized similarly to the convolutional layers (but with an orthogonal std of 0.01 instead of np.sqrt(2)). This is also something done by StableBaselines3 and CleanRL by default, and as far as I could tell, it was not done in Tianshou. Thus, I had to add this custom functionality, as seen in the actor_init() and critic_init() functions in ppo_atari.py above. So you might see some similarities when looking through the code above with examples.atari.atari_ppo.py.

TLDR: No, I was not able to just use examples.atari.atari_ppo.py by passing in arguments since they were insufficient and instead, I had to modify examples.atari.atari_ppo.py to better fit the original PPO implementation.

dantp-ai commented 1 month ago

Thanks! I will look into it and see how I can help.

rajfly commented 1 month ago

@dantp-ai Thanks for your help! Also perhaps this might help narrow down the issue: I tested on 56 Atari games and Tianshou failed to learn anything at all on the majority of games and was only able to do well for very simple games (approximately 5 out of the 56 games). For example, the Boxing game shown below with Tianshou in green. So the agent can learn and is able to compete with the implementations by Baselines, Stable Baselines3, and CleanRL, but only in very simple environments, which is weird.

MischaPanch commented 4 weeks ago

Thanks for reporting! It is of highest priority to us to keep good performance of algorithms and examples (otherwise, what's the point ^^).

Seems like the small performance tests that run in CI were not enough to catch this. I have been training PPO agents on mujoco in the current tianshou version with no issues, so maybe it's only for discrete envs.

We will look into it asap. The first thing to clarify is whether the problem is caused by the recent refactorings, thus going back to version 0.5.1 and running on atari there.

Btw, before the 2.0.0 release of tianshou we will implement #935 and #1110, as well as check in a script that reproduces the results that are currently displayed in the docs. From there on, all releases will guarantee to have no performance issues. At the moment we're not there yet

dantp-ai commented 3 weeks ago

@rajfly Were you able to verify that the reward scalings and reward outputs are consistent across the experiments using the different RL libraries (OpenAI Baselines, StableBaselines3, CleanRL) ?

rajfly commented 3 weeks ago

@dantp-ai Yes. I followed the same Atari wrappers for all of the experiments with the Atari games. Thus, the rewards were only clipped and were clipped for all RL libraries. Furthermore, when comparing the reward outputs, I used statistical techniques such as stratified bootstrap confidence intervals (SBCI) to combat the stochasticity for more accurate estimates. In particular, for each RL library tested, I ran 5 trials for each of the 56 Atari environments. This evaluates to a total of 56 x 5 trials per RL library. Subsequently, I took the mean reward from the last 100 training episodes as the score for a single trial and human-normalized it. The plot below shows the comparison of the human-normalized score attained from different RL libraries across 56 environments, using SBCI. The bands are 95% confidence interval bands and it can be seen that Baselines, Stable Baselines3, and CleanRL are consistent with their scores for most metrics (IQM refers to interquartile mean).

thu-ml / tianshou

Unable to replicate original PPO performance #1157