[rllib] PPO centralized critic example with more than two agents

korbinian-hoermann commented 3 years ago

What is the problem?

ray 1.0.1, Python 3.7, TensorFlow 2.3.1, Windows 10

Hi!

I am trying to solve the following environment with the MAPPO (PPO with a centralized critic)

Reward For each time step a agents is not in its final position, it receives a reward of -1 For each time step a agents is in its final position, it receives a reward of 0

Actions

“DOWN”
“LEFT”
“UP”
“RIGHT”
“NOOP”

Observation For each agent an obs consists of:

ID of the agent
Y coordinate of the agent
X coordinate of the agent
a current step count (for time reference)

Resulting in the following observation and action spaces of the environment:

_action_space = spaces.Discrete(5) observationspace = spaces.Box(np.array([0., 0., 0., 0.]), np.array([1., 1., 1., 1.]))

One episode lasts for 50 time steps. The goal for all agents is to get into their final position (cell which has the same colour as the corresponding agent) as fast as possibe and stay in there until the episode ends.

I was able to solve this environment with 2 agents, following rllibs’s centralized critic example.

In order to handle a increased number of agents, I made following changes to the example code (see section "centralized critic model", I am only using the TF version):

import argparse
import copy
import os
import numpy as np
from gym import spaces
from gym.utils import seeding
import ray
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray import tune
from ray.rllib.agents.ppo.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy, KLCoeffMixin, ppo_surrogate_loss as tf_loss
from ray.rllib.evaluation.postprocessing import compute_advantages, Postprocessing
from ray.rllib.models import ModelCatalog
from ray.rllib.policy.sample_batch import SampleBatch
from ray.rllib.policy.tf_policy import LearningRateSchedule, EntropyCoeffSchedule
from ray.rllib.utils.tf_ops import explained_variance, make_tf_callable
from ray.tune import register_env
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
from ray.rllib.utils.annotations import override
from ray.rllib.utils.framework import try_import_tf

"""  ------------------------------  multi-agent environment ------------------------------  
This is a modified version of Koul, Anurag's Switch4-v0 environment (https://github.com/koulanurag/ma-gym).
"""

class Switch(MultiAgentEnv):

    NUMBER_AGENTS = 4
    action_space = spaces.Discrete(5)
    observation_space = spaces.Box(np.array([0., 0., 0., 0.]), np.array([1., 1., 1., 1.]))

    def __init__(self, config):

        self.action_space = Switch.action_space  # l,r,t,d,noop
        self.observation_space = Switch.observation_space
        self._max_steps = 50
        self._total_episode_reward = 0
        self._step_count = 0

        self._grid_shape = (3, 7)
        self._grid_height = self._grid_shape[0]
        self._grid_width = self._grid_shape[1]

        self.n_agents = Switch.NUMBER_AGENTS
        self._agent_ids = (i for i in range(self.n_agents))
        self._agent_dones = {}

        #  initial positions of the 4 agents
        self.init_agent_pos = {0: [0, 1],
                               1: [0, self._grid_width - 2],
                               2: [self._grid_height - 1, 1],
                               3: [self._grid_height - 1, self._grid_width - 2]}

        #  target positions of the 4 agents
        self.final_agent_pos = {0: [0, self._grid_width - 1],
                                1: [0, 0],
                                2: [self._grid_height - 1, self._grid_width - 1],
                                3: [self._grid_height - 1, 0]}

        self._base_grid = self.__create_grid()  # with no agents
        self._full_obs = self.__create_grid()
        self.__init_full_obs()
        self.viewer = None
        self.seed()
        self._render = False  # no render implemented in this script

        if self._render:
            self.render()

    @property
    def reset(self):
        self.__init_full_obs()
        self._step_count = 0
        self._agent_dones = [False for _ in range(self.n_agents)]
        self._agent_dones["__all__"] = False
        self._total_episode_reward = [0 for _ in range(self.n_agents)]
        return self.__get_agent_obs()

    def step(self, agents_action):
        self._step_count += 1
        rewards_dict = {}

        for agent_i, action in agents_action.items():
            self.__update_agent_pos(agent_i, action)

        # For each time step in which the agent is not done he gets a negative reward
        # otherwise he gets a reward of 0
        for i in range(self.n_agents):
            if self.__is_agent_done(i):
                reward = 0
            else:
                reward = -1

            rewards_dict[i] = reward

        if self._step_count >= self._max_steps:  # stop episode after max_steps
            for i in range(self.n_agents):
                self._agent_dones[i] = True
            self._agent_dones["__all__"] = True

        observation = self.__get_agent_obs()

        if self._render:
            self.render()

        return observation, rewards_dict, self._agent_dones, {}

    def render(self, mode='human', action=None):
        pass  # removed rendering to reduce LOC in this script

    def __get_agent_obs(self):
        _obs_dict = {}
        for agent_i in range(self.n_agents):
            pos = self.agent_pos[agent_i]
            _agent_i_obs = [float(agent_i / self.n_agents),  # agent id
                            round(pos[0] / (self._grid_shape[0] - 1), 2),  # x pos
                            round(pos[1] / (self._grid_shape[1] - 1), 2),  # y pos
                            self._step_count / self._max_steps]  # current step count (for time reference)
            _obs_dict[agent_i] = _agent_i_obs
        return _obs_dict

    def __create_grid(self):
        _grid = -1 * np.ones(self._grid_shape)  # all are walls
        _grid[self._grid_shape[0] // 2, :] = 0  # road in the middle
        _grid[:, [0, 1]] = 0
        _grid[:, [-1, -2]] = 0
        return _grid

    def __init_full_obs(self):
        self.agent_pos = copy.deepcopy(self.init_agent_pos)
        self._full_obs = self.__create_grid()
        for agent_i, pos in self.agent_pos.items():
            self.__update_agent_view(agent_i)

    def __wall_exists(self, pos):
        row, col = pos
        return self._base_grid[row, col] == -1

    def __is_cell_vacant(self, pos):
        is_valid = (0 <= pos[0] < self._grid_shape[0]) and (0 <= pos[1] < self._grid_shape[1])
        return is_valid and (self._full_obs[pos[0], pos[1]] == 0)

    def __update_agent_pos(self, agent_i, move):
        curr_pos = copy.deepcopy(self.agent_pos[agent_i])
        next_pos = None
        if move == 0:  # down
            next_pos = [curr_pos[0] + 1, curr_pos[1]]
        elif move == 1:  # left
            next_pos = [curr_pos[0], curr_pos[1] - 1]
        elif move == 2:  # up
            next_pos = [curr_pos[0] - 1, curr_pos[1]]
        elif move == 3:  # right
            next_pos = [curr_pos[0], curr_pos[1] + 1]
        elif move == 4:  # no-op
            pass
        else:
            raise Exception('Action Not found!')

        if next_pos is not None and self.__is_cell_vacant(next_pos):
            self.agent_pos[agent_i] = next_pos
            self._full_obs[curr_pos[0], curr_pos[1]] = 0
            self.__update_agent_view(agent_i)
        else:
            pass

    def __update_agent_view(self, agent_i):
        self._full_obs[self.agent_pos[agent_i][0], self.agent_pos[agent_i][1]] = agent_i + 1

    def __is_agent_done(self, agent_i):
        return self.agent_pos[agent_i] == self.final_agent_pos[agent_i]

    def seed(self, n=None):
        self.np_random, seed = seeding.np_random(n)
        return [seed]

    def close(self):
        if self.viewer is not None:
            self.viewer.close()
            self.viewer = None

"""  ------------------------------  centralized critic model ------------------------------  """

tf1, tf, tfv = try_import_tf()

class CentralizedCriticModel(TFModelV2):
    """Multi-agent model that implements a centralized value function."""

    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
        super(CentralizedCriticModel, self).__init__(obs_space, action_space, num_outputs, model_config, name)

        # Base of the model
        self.model = FullyConnectedNetwork(obs_space, action_space, num_outputs, model_config, name)
        self.register_variables(self.model.variables())

        n_agents = 4  # ---> opp_obs and opp_acts now consist of 3 (4 - 1) different agent information
        obs = 4
        act = 5
        opp_obs_accum = obs * (n_agents - 1)
        opp_acts_accum = act * (n_agents - 1)

        # Central VF maps (obs, opp_obs, opp_act) -> vf_pred
        obs = tf.keras.layers.Input(shape=(obs,), name="obs")
        opp_obs = tf.keras.layers.Input(shape=(opp_obs_accum,), name="opp_obs")
        opp_act = tf.keras.layers.Input(shape=(opp_acts_accum,), name="opp_act")
        concat_obs = tf.keras.layers.Concatenate(axis=1)([obs, opp_obs, opp_act])
        central_vf_dense = tf.keras.layers.Dense(16, activation=tf.nn.tanh, name="c_vf_dense")(concat_obs)
        central_vf_out = tf.keras.layers.Dense(1, activation=None, name="c_vf_out")(central_vf_dense)

        self.central_vf = tf.keras.Model(inputs=[obs, opp_obs, opp_act],outputs=central_vf_out)
        self.register_variables(self.central_vf.variables)

    @override(ModelV2)
    def forward(self, input_dict, state, seq_lens):
        return self.model.forward(input_dict, state, seq_lens)

    def central_value_function(self, obs, opponent_obs, opponent_actions):
        return tf.reshape(self.central_vf([obs, opponent_obs, tf.one_hot(opponent_actions, 5)]),
                          [-1])  # ---> changed the depth of one_hot encoding to 5 (5 actions)

    @override(ModelV2)
    def value_function(self):
        return self.model.value_function()  # not used

OPPONENT_OBS = "opponent_obs"
OPPONENT_ACTION = "opponent_action"

class CentralizedValueMixin:
    """Add method to evaluate the central value function from the model."""

    def __init__(self):
        if self.config["framework"] != "torch":
            self.compute_central_vf = make_tf_callable(self.get_session())(
                self.model.central_value_function)

# Grabs the opponent obs/act and includes it in the experience train_batch,
# and computes GAE using the central vf predictions.
def centralized_critic_postprocessing(policy,
                                      sample_batch,
                                      other_agent_batches=None,
                                      episode=None):
    if policy.loss_initialized():
        assert other_agent_batches is not None
        # [(_, opponent_batch)] = list(other_agent_batches.values())

        # ---> opponent batch now consists of 3 SampleBatches, so I concatenate them
        concat_opponent_batch = SampleBatch.concat_samples([opponent_n_batch
                                                            for _, opponent_n_batch in other_agent_batches.values()])

        opponent_batch = concat_opponent_batch

        # also record the opponent obs and actions in the trajectory
        sample_batch[OPPONENT_OBS] = opponent_batch[SampleBatch.CUR_OBS]
        sample_batch[OPPONENT_ACTION] = opponent_batch[SampleBatch.ACTIONS]

        sample_batch[SampleBatch.VF_PREDS] = policy.compute_central_vf(
            sample_batch[SampleBatch.CUR_OBS], sample_batch[OPPONENT_OBS],
            sample_batch[OPPONENT_ACTION])
    else:
        # Policy hasn't been initialized yet, use zeros.
        sample_batch[OPPONENT_OBS] = np.zeros_like(
            sample_batch[SampleBatch.CUR_OBS])
        sample_batch[OPPONENT_ACTION] = np.zeros_like(
            sample_batch[SampleBatch.ACTIONS])
        sample_batch[SampleBatch.VF_PREDS] = np.zeros_like(
            sample_batch[SampleBatch.REWARDS], dtype=np.float32)

    completed = sample_batch["dones"][-1]
    if completed:
        last_r = 0.0
    else:
        last_r = sample_batch[SampleBatch.VF_PREDS][-1]

    train_batch = compute_advantages(
        sample_batch,
        last_r,
        policy.config["gamma"],
        policy.config["lambda"],
        use_gae=policy.config["use_gae"])
    return train_batch

# Copied from PPO but optimizing the central value function.
def loss_with_central_critic(policy, model, dist_class, train_batch):
    CentralizedValueMixin.__init__(policy)
    func = tf_loss

    vf_saved = model.value_function

    model.value_function = lambda: policy.model.central_value_function(
        train_batch[SampleBatch.CUR_OBS], train_batch[OPPONENT_OBS],
        train_batch[OPPONENT_ACTION])

    policy._central_value_out = model.value_function()
    loss = func(policy, model, dist_class, train_batch)

    model.value_function = vf_saved

    return loss

def setup_tf_mixins(policy, obs_space, action_space, config):
    # Copied from PPOTFPolicy (w/o ValueNetworkMixin).
    KLCoeffMixin.__init__(policy, config)
    EntropyCoeffSchedule.__init__(policy, config["entropy_coeff"],
                                  config["entropy_coeff_schedule"])
    LearningRateSchedule.__init__(policy, config["lr"], config["lr_schedule"])

def central_vf_stats(policy, train_batch, grads):
    # Report the explained variance of the central value function.
    return {
        "vf_explained_var": explained_variance(
            train_batch[Postprocessing.VALUE_TARGETS],
            policy._central_value_out),
    }

CCPPOTFPolicy = PPOTFPolicy.with_updates(
    name="CCPPOTFPolicy",
    postprocess_fn=centralized_critic_postprocessing,
    loss_fn=loss_with_central_critic,
    before_loss_init=setup_tf_mixins,
    grad_stats_fn=central_vf_stats,
    mixins=[
        LearningRateSchedule, EntropyCoeffSchedule, KLCoeffMixin,
        CentralizedValueMixin
    ])

""" ---> Original, didn't return CCPPOTFPolicy
def get_policy_class(config):
    if config["framework"] == "torch":
        return CCPPOTorchPolicy"""

def get_policy_class(config):
    if config["framework"] == "tf":
        return CCPPOTFPolicy

CCTrainer = PPOTrainer.with_updates(
    name="CCPPOTrainer",
    default_policy=CCPPOTFPolicy,
    get_policy_class=get_policy_class,
)

"""  ------------------------------  train the agents with rllib ------------------------------  """

parser = argparse.ArgumentParser()
parser.add_argument("--stop-iters", type=int, default=1000)
parser.add_argument("--stop-timesteps", type=int, default=1_000_000)
parser.add_argument("--stop-reward", type=float, default=-23)

register_env("my_switch", lambda _: Switch({}))
ModelCatalog.register_custom_model("cc_model", CentralizedCriticModel)

if __name__ == "__main__":
    ray.init(local_mode=True, include_dashboard=False)
    args = parser.parse_args()

    config = {
        "env": "my_switch",
        "batch_mode": "complete_episodes",
        # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
        "num_workers": 0,
        "multiagent": {
            "policies": {  # ---> Each agent should train its own policy based on the central vf
                "pol0": (None, Switch.observation_space, Switch.action_space, {
                    "framework": "tf",
                }),
                "pol1": (None, Switch.observation_space, Switch.action_space, {
                    "framework": "tf",
                }),
                "pol2": (None, Switch.observation_space, Switch.action_space, {
                    "framework": "tf",
                }),
                "pol3": (None, Switch.observation_space, Switch.action_space, {
                    "framework": "tf",
                }),
            },
            "policy_mapping_fn": lambda x: "pol0" if x == 0 else ("pol1" if x == 1 else ("pol2" if x == 2 else "pol3")),
        },
        "model": {
            "custom_model": "cc_model",
        },
        "framework": "tf",
    }

    stop = {
        "training_iteration": args.stop_iters,
        "timesteps_total": args.stop_timesteps,
        "episode_reward_mean": args.stop_reward,
    }

    results = tune.run(CCTrainer,
                       name="switch_v4",
                       config=config,
                       stop=stop,
                       verbose=1,
                       checkpoint_freq=10,
                       checkpoint_at_end=True,
                       mode="max",
                       metric="episode_reward_mean"
                       )

    print(f"Best checkpoint at: {results.best_checkpoint}")

This results in the following error message:

C:\Users\z004757h\Anaconda3\envs\marl-env-v3\python.exe D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py
WARNING:tensorflow:From C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\gym\logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
2020-12-14 20:18:13,396 ERROR syncer.py:63 -- Log sync requires rsync to be installed.
2020-12-14 20:18:13,447 INFO trainer.py:591 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-12-14 20:18:13,447 INFO trainer.py:618 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
WARNING:tensorflow:Model was constructed with shape (?, 12) for input Tensor("pol0/opp_obs:0", shape=(?, 12), dtype=float32), but it was called on an input with incompatible shape (?, 4).
WARNING:tensorflow:Model was constructed with shape (?, 15) for input Tensor("pol0/opp_act:0", shape=(?, 15), dtype=float32), but it was called on an input with incompatible shape (?, 5).
E1214 20:18:14.974681 27996 28292 core_worker.cc:1128] Pushed Error with JobID: 01000000 of type: task with message: ray::CCPPOTrainer.__init__() (pid=27996, ip=192.168.2.119)
  File "python\ray\_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\function_manager.py", line 553, in actor_method_executor
    return method(actor, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 101, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 476, in __init__
    super().__init__(config, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\trainable.py", line 249, in __init__
    self.setup(copy.deepcopy(self.config))
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 629, in setup
    self._init(self.config, self.env_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 125, in _init
    self.config["num_workers"])
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 699, in _make_workers
    logdir=self.logdir)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 74, in __init__
    self._local_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 305, in _make_worker
    extra_python_environs=extra_python_environs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 416, in __init__
    self._build_policy_map(policy_dict, policy_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1008, in _build_policy_map
    policy_map[name] = cls(obs_space, act_space, merged_conf)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\tf_policy_template.py", line 221, in __init__
    obs_include_prev_action_reward=obs_include_prev_action_reward)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 299, in __init__
    self._initialize_loss_dynamically()
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 437, in _initialize_loss_dynamically
    loss = self._do_loss_init(train_batch)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 449, in _do_loss_init
    loss = self._loss_fn(self, self.model, self.dist_class, train_batch)
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 302, in loss_with_central_critic
    policy._central_value_out = model.value_function()
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 300, in <lambda>
    train_batch[OPPONENT_ACTION])
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 223, in central_value_function
    return tf.reshape(self.central_vf([obs, opponent_obs, tf.one_hot(opponent_actions, 5)]),
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 776, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 386, in call
    inputs, training=training, mask=mask)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 508, in _run_internal_graph
    outputs = node.layer(*args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 752, in __call__
    self.name)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\input_spec.py", line 216, in assert_input_compatibility
    ' but received input with shape ' + str(shape))
ValueError: Input 0 of layer c_vf_dense is incompatible with the layer: expected axis -1 of input shape to have value 31 but received input with shape [None, 13] at time: 1.60797e+09
E1214 20:18:15.030683 27996 28292 core_worker.cc:1128] Pushed Error with JobID: 01000000 of type: task with message: ray::CCPPOTrainer.train() (pid=27996, ip=192.168.2.119)
  File "python\ray\_raylet.pyx", line 445, in ray._raylet.execute_task
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\worker.py", line 174, in reraise_actor_init_error
    raise self.actor_init_error
  File "python\ray\_raylet.pyx", line 479, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 483, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\function_manager.py", line 553, in actor_method_executor
    return method(actor, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 101, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 476, in __init__
    super().__init__(config, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\trainable.py", line 249, in __init__
    self.setup(copy.deepcopy(self.config))
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 629, in setup
    self._init(self.config, self.env_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 125, in _init
    self.config["num_workers"])
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 699, in _make_workers
    logdir=self.logdir)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 74, in __init__
    self._local_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 305, in _make_worker
    extra_python_environs=extra_python_environs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 416, in __init__
    self._build_policy_map(policy_dict, policy_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1008, in _build_policy_map
    policy_map[name] = cls(obs_space, act_space, merged_conf)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\tf_policy_template.py", line 221, in __init__
    obs_include_prev_action_reward=obs_include_prev_action_reward)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 299, in __init__
    self._initialize_loss_dynamically()
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 437, in _initialize_loss_dynamically
    loss = self._do_loss_init(train_batch)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 449, in _do_loss_init
    loss = self._loss_fn(self, self.model, self.dist_class, train_batch)
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 302, in loss_with_central_critic
    policy._central_value_out = model.value_function()
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 300, in <lambda>
    train_batch[OPPONENT_ACTION])
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 223, in central_value_function
    return tf.reshape(self.central_vf([obs, opponent_obs, tf.one_hot(opponent_actions, 5)]),
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 776, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 386, in call
    inputs, training=training, mask=mask)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 508, in _run_internal_graph
    outputs = node.layer(*args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 752, in __call__
    self.name)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\input_spec.py", line 216, in assert_input_compatibility
    ' but received input with shape ' + str(shape))
ValueError: Input 0 of layer c_vf_dense is incompatible with the layer: expected axis -1 of input shape to have value 31 but received input with shape [None, 13] at time: 1.60797e+09
== Status ==
Memory usage on this node: 10.5/15.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/12 CPUs, 0/1 GPUs, 0.0/3.32 GiB heap, 0.0/1.12 GiB objects
Result logdir: C:\Users\z004757h\ray_results\switch_v4
Number of trials: 1 (1 RUNNING)
+------------------------------------+----------+-------+
| Trial name                         | status   | loc   |
|------------------------------------+----------+-------|
| CCPPOTrainer_my_switch_1bed7_00000 | RUNNING  |       |
+------------------------------------+----------+-------+

2020-12-14 20:18:15,075 ERROR trial_runner.py:567 -- Trial CCPPOTrainer_my_switch_1bed7_00000: Error processing event.
Traceback (most recent call last):
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::CCPPOTrainer.train() (pid=27996, ip=192.168.2.119)
  File "python\ray\_raylet.pyx", line 445, in ray._raylet.execute_task
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\worker.py", line 174, in reraise_actor_init_error
    raise self.actor_init_error
  File "python\ray\_raylet.pyx", line 479, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 483, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\function_manager.py", line 553, in actor_method_executor
    return method(actor, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 101, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 476, in __init__
    super().__init__(config, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\trainable.py", line 249, in __init__
    self.setup(copy.deepcopy(self.config))
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 629, in setup
    self._init(self.config, self.env_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 125, in _init
    self.config["num_workers"])
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 699, in _make_workers
    logdir=self.logdir)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 74, in __init__
    self._local_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 305, in _make_worker
    extra_python_environs=extra_python_environs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 416, in __init__
    self._build_policy_map(policy_dict, policy_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1008, in _build_policy_map
    policy_map[name] = cls(obs_space, act_space, merged_conf)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\tf_policy_template.py", line 221, in __init__
    obs_include_prev_action_reward=obs_include_prev_action_reward)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 299, in __init__
    self._initialize_loss_dynamically()
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 437, in _initialize_loss_dynamically
    loss = self._do_loss_init(train_batch)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 449, in _do_loss_init
    loss = self._loss_fn(self, self.model, self.dist_class, train_batch)
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 302, in loss_with_central_critic
    policy._central_value_out = model.value_function()
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 300, in <lambda>
    train_batch[OPPONENT_ACTION])
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 223, in central_value_function
    return tf.reshape(self.central_vf([obs, opponent_obs, tf.one_hot(opponent_actions, 5)]),
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 776, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 386, in call
    inputs, training=training, mask=mask)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 508, in _run_internal_graph
    outputs = node.layer(*args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 752, in __call__
    self.name)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\input_spec.py", line 216, in assert_input_compatibility
    ' but received input with shape ' + str(shape))
ValueError: Input 0 of layer c_vf_dense is incompatible with the layer: expected axis -1 of input shape to have value 31 but received input with shape [None, 13]
E1214 20:18:15.169682 27996 28292 core_worker.cc:1128] Pushed Error with JobID: 01000000 of type: task with message: ray::CCPPOTrainer.stop() (pid=27996, ip=192.168.2.119)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::CCPPOTrainer.train() (pid=27996, ip=192.168.2.119)
  File "python\ray\_raylet.pyx", line 445, in ray._raylet.execute_task
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\worker.py", line 174, in reraise_actor_init_error
    raise self.actor_init_error
  File "python\ray\_raylet.pyx", line 479, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 483, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\function_manager.py", line 553, in actor_method_executor
    return method(actor, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 101, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 476, in __init__
    super().__init__(config, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\trainable.py", line 249, in __init__
    self.setup(copy.deepcopy(self.config))
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 629, in setup
    self._init(self.config, self.env_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 125, in _init
    self.config["num_workers"])
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 699, in _make_workers
    logdir=self.logdir)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 74, in __init__
    self._local_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 305, in _make_worker
    extra_python_environs=extra_python_environs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 416, in __init__
    self._build_policy_map(policy_dict, policy_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1008, in _build_policy_map
    policy_map[name] = cls(obs_space, act_space, merged_conf)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\tf_policy_template.py", line 221, in __init__
    obs_include_prev_action_reward=obs_include_prev_action_reward)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 299, in __init__
    self._initialize_loss_dynamically()
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 437, in _initialize_loss_dynamically
    loss = self._do_loss_init(train_batch)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 449, in _do_loss_init
    loss = self._loss_fn(self, self.model, self.dist_class, train_batch)
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 302, in loss_with_central_critic
    policy._central_value_out = model.value_function()
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 300, in <lambda>
    train_batch[OPPONENT_ACTION])
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 223, in central_value_function
    return tf.reshape(self.central_vf([obs, opponent_obs, tf.one_hot(opponent_actions, 5)]),
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 776, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 386, in call
    inputs, training=training, mask=mask)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 508, in _run_internal_graph
    outputs = node.layer(*args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 752, in __call__
    self.name)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\input_spec.py", line 216, in assert_input_compatibility
    ' but received input with shape ' + str(shape))
ValueError: Input 0 of layer c_vf_dense is incompatible with the layer: expected axis -1 of input shape to have value 31 but received input with shape [None, 13]

During handling of the above exception, another exception occurred:

ray::CCPPOTrainer.stop() (pid=27996, ip=192.168.2.119)
  File "python\ray\_raylet.pyx", line 445, in ray._raylet.execute_task
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\worker.py", line 174, in reraise_actor_init_error
    raise self.actor_init_error
  File "python\ray\_raylet.pyx", line 445, in ray._raylet.execute_task
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\worker.py", line 174, in reraise_actor_init_error
    raise self.actor_init_error
  File "python\ray\_raylet.pyx", line 479, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 483, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\function_manager.py", line 553, in actor_method_executor
    return method(actor, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 101, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 476, in __init__
    super().__init__(config, logger_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\trainable.py", line 249, in __init__
    self.setup(copy.deepcopy(self.config))
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 629, in setup
    self._init(self.config, self.env_creator)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer_template.py", line 125, in _init
    self.config["num_workers"])
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\agents\trainer.py", line 699, in _make_workers
    logdir=self.logdir)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 74, in __init__
    self._local_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 305, in _make_worker
    extra_python_environs=extra_python_environs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 416, in __init__
    self._build_policy_map(policy_dict, policy_config)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1008, in _build_policy_map
    policy_map[name] = cls(obs_space, act_space, merged_conf)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\tf_policy_template.py", line 221, in __init__
    obs_include_prev_action_reward=obs_include_prev_action_reward)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 299, in __init__
    self._initialize_loss_dynamically()
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 437, in _initialize_loss_dynamically
    loss = self._do_loss_init(train_batch)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\rllib\policy\dynamic_tf_policy.py", line 449, in _do_loss_init
    loss = self._loss_fn(self, self.model, self.dist_class, train_batch)
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 302, in loss_with_central_critic
    policy._central_value_out = model.value_function()
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 300, in <lambda>
    train_batch[OPPONENT_ACTION])
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 223, in central_value_function
    return tf.reshape(self.central_vf([obs, opponent_obs, tf.one_hot(opponent_actions, 5)]),
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 776, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 386, in call
    inputs, training=training, mask=mask)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 508, in _run_internal_graph
    outputs = node.layer(*args, **kwargs)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 752, in __call__
    self.name)
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\tensorflow\python\keras\engine\input_spec.py", line 216, in assert_input_compatibility
    ' but received input with shape ' + str(shape))
ValueError: Input 0 of layer c_vf_dense is incompatible with the layer: expected axis -1 of input shape to have value 31 but received input with shape [None, 13] at time: 1.60797e+09
== Status ==
Memory usage on this node: 10.5/15.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/3.32 GiB heap, 0.0/1.12 GiB objects
Result logdir: C:\Users\z004757h\ray_results\switch_v4
Number of trials: 1 (1 ERROR)
+------------------------------------+----------+-------+
| Trial name                         | status   | loc   |
|------------------------------------+----------+-------|
| CCPPOTrainer_my_switch_1bed7_00000 | ERROR    |       |
+------------------------------------+----------+-------+
Number of errored trials: 1
+------------------------------------+--------------+------------------------------------------------------------------------------------------------------------+
| Trial name                         |   # failures | error file                                                                                                 |
|------------------------------------+--------------+------------------------------------------------------------------------------------------------------------|
| CCPPOTrainer_my_switch_1bed7_00000 |            1 | C:\Users\z004757h\ray_results\switch_v4\CCPPOTrainer_my_switch_1bed7_00000_0_2020-12-14_20-18-12\error.txt |
+------------------------------------+--------------+------------------------------------------------------------------------------------------------------------+

Traceback (most recent call last):
  File "D:/Git/example-codes/Multi_Agent/RLlib/MAPPO/switch_v4/switch_v4_for_github.py", line 413, in <module>
    metric="episode_reward_mean"
  File "C:\Users\z004757h\Anaconda3\envs\marl-env-v3\lib\site-packages\ray\tune\tune.py", line 427, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [CCPPOTrainer_my_switch_1bed7_00000])

Process finished with exit code 1

Did anyone manage to adjust the number of agents in the centralized_critic.py example or has an idea what else I have to change?

Thank you in advance!

Cheers, Korbi :)

Reproduction (REQUIRED)

[x] I have verified my script runs in a clean environment and reproduces the issue.
[x] I have verified the issue also occurs with the latest wheels.

sven1977 commented 3 years ago

This is an interesting example. But have you made sure that your OPPONENT_OBS in your postprocessing fn is built for the correct number of opponent agents? I'm only seeing you do this (loss not initialized yet):

        sample_batch[OPPONENT_OBS] = np.zeros_like(
            sample_batch[SampleBatch.CUR_OBS])

which is only 0s for one opponent. That's why you get the shape error: You are passing in data for 1 opponent, instead of 3.

zzchuman commented 3 years ago

This is an interesting example. But have you made sure that your OPPONENT_OBS in your postprocessing fn is built for the correct number of opponent agents? I'm only seeing you do this (loss not initialized yet):
        sample_batch[OPPONENT_OBS] = np.zeros_like(
            sample_batch[SampleBatch.CUR_OBS])
which is only 0s for one opponent. That's why you get the shape error: You are passing in data for 1 opponent, instead of 3.

Hello sven1977, I want use PPO via a centralized critic class also. I try to custom it. But, I do not know Which classes must be modified? Agents, action space, state space must be modified, right?

centralized_critic_postprocessing,loss_with_central_critic, setup_tf_mixins, central_vf_stats,LearningRateSchedule, EntropyCoeffSchedule, KLCoeffMixin,CentralizedValueMixin

I want to know these function or class whether should be modified. Thank you!

korbinian-hoermann commented 3 years ago

Thank you @sven1977 ! I had to do some more adjustments in the postprocessing fn but it seems to work now. 👍

@zzchuman those are the functions I had to modify:

init function of CentralizedCriticModel(TFModelV2) (adjust the inputs of the net to suit your agent's and opponent agents' acttions/ observations)
centralized_critic_postprocessing as sven mentioned above: built OPPONENT_OBS for the correct number of opponent agents.
get_policy_class Using the tf framework the original one raised an error so I changed it to this

def get_policy_class(config): if config["framework"] == "torch": return CCPPOTorchPolicy else return CCPPOTFPolicy

zzchuman commented 3 years ago

Thank you @sven1977 ! I had to do some more adjustments in the postprocessing fn but it seems to work now. 👍

@zzchuman those are the functions I had to modify:

init function of CentralizedCriticModel(TFModelV2) (adjust the inputs of the net to suit your agent's and opponent agents' acttions/ observations)

centralized_critic_postprocessing as sven mentioned above: built OPPONENT_OBS for the correct number of opponent agents.

get_policy_class Using the tf framework the original one raised an error so I changed it to this

def get_policy_class(config): if config["framework"] == "torch": return CCPPOTorchPolicy else return CCPPOTFPolicy

Hello korbin, congratulations! I have some questions:

You have change the framework as torch, and use the torch to build the model, right?
Except for CentralizedCriticModel(TFModelV2) and centralized_critic_postprocessing, which chalss and function must be rewrited? Thank you!

kapilPython commented 2 years ago

There is a problem with the SampleBatches.concat_samples method. This method generates the output sample batch in the format ((num_agents-1) length_sample_batch, obs_dim) and what we want is a output sample batch in the format of (length_sample_batch, (num_agents-1)obs_dim), This dimension should be followed while initialization of the opponent_obs/opponent action as well. I have made a new concat_batches function in SampleBatch Class to enable the usage of central critic for more than two agents if the community wants I can make a pull request for this change.

amadou1998 commented 1 year ago

Currently attempting the exact same thing. Is it possible to restate the complete correct implementation of your centralized crititc with multiple agent? Maybe adapting the example to support mulitple agents right of the batch would ease this for many users.

zhaohubo commented 1 year ago

Currently attempting the exact same thing. Is it possible to restate the complete correct implementation of your centralized crititc with multiple agent? Maybe adapting the example to support mulitple agents right of the batch would ease this for many users.

hello Has anyone implemented it?

ray-project / ray

[rllib] PPO centralized critic example with more than two agents #12851

What is the problem?

Reproduction (REQUIRED)