[rllib] Error with centralised critic PPO for multiagent env (pettingzoo waterworld)

george-skal commented 3 years ago

Hi! I am using ray 1.2.0 and Python 3.6 on Ubuntu 18.04

I am trying a centralised critic PPO for the waterworld environment from Pettingzoo[sisl] https://www.pettingzoo.ml/sisl/waterworld.

The error I get is:

ValueError: Input 1 is incompatible with layer model_1: expected shape=(None, 968), found shape=(2000, 242)

but if I try (out of curiocity) to double the shape dim in the layer one setting:

opp_obs = tf.keras.layers.Input(shape=(2*opp_obs_dim, ), name="opp_obs")

I get the error:

ValueError: Input 1 is incompatible with layer model_1: expected shape=(None, 1936), found shape=(None, 968)

that seems odd to me since this is the correct found shape (opp_obs_dim = 968).

I think that the problem might be in the initialisaton code (I used the solution from here: https://github.com/ray-project/ray/issues/8011

# Policy hasn't been initialized yet, use zeros.
        sample_batch[OPPONENT_OBS] = np.zeros_like([np.zeros(obs_dim * (n_pursuers - 1))])
        sample_batch[OPPONENT_ACTION] = np.zeros_like([np.zeros(act_dim * (n_pursuers - 1))])
        ### I think I don't have to change this
        sample_batch[SampleBatch.VF_PREDS] = np.zeros_like(sample_batch[SampleBatch.REWARDS], dtype=np.float32)

The environment has:

Observation space: Box(low=np.float32(-np.sqrt(2)), high=np.float32(2 * np.sqrt(2)), shape=(self._obs_dim,), dtype=np.float32)
Action space: Box(low=np.float32(-self._max_accel), high=np.float32(self._max_accel), shape=(2,), dtype=np.float32)

The observation space shape of 1 agent is 242. I am new to Rllib so I am not sure if it is a bug or not, but something I don’t understand, therefore I would appreciate any help. Also I am not using one_hot since the environment is continuous, but I am not sure about it and I would be happy if someone could clarify this, or inform about other things I should change.

Also, @korbinian-hoermann since I was based partly on the issue you opened here https://github.com/ray-project/ray/issues/12851 , please let me know if I do something wrong.

I get a similar error with torch.

My code:

import argparse
import numpy as np
import os
import ray
from ray import tune
from ray.rllib.agents.ppo.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy, KLCoeffMixin, \
    ppo_surrogate_loss as tf_loss
from ray.rllib.agents.ppo.ppo_torch_policy import PPOTorchPolicy, \
    KLCoeffMixin as TorchKLCoeffMixin, ppo_surrogate_loss as torch_loss
from ray.rllib.evaluation.postprocessing import compute_advantages, \
    Postprocessing
from ray.rllib.examples.env.two_step_game import TwoStepGame
from ray.rllib.examples.models.centralized_critic_models import \
    CentralizedCriticModel, TorchCentralizedCriticModel
from ray.rllib.models import ModelCatalog
from ray.rllib.policy.sample_batch import SampleBatch
from ray.rllib.policy.tf_policy import LearningRateSchedule, \
    EntropyCoeffSchedule
from ray.rllib.policy.torch_policy import LearningRateSchedule as TorchLR, \
    EntropyCoeffSchedule as TorchEntropyCoeffSchedule
from ray.rllib.utils.test_utils import check_learning_achieved
from ray.rllib.utils.tf_ops import explained_variance, make_tf_callable
from ray.rllib.utils.torch_ops import convert_to_torch_tensor
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.tf.fcnet import FullyConnectedNetwork
from ray.rllib.models.torch.misc import SlimFC
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray.rllib.utils.annotations import override
from ray.rllib.utils.framework import try_import_tf, try_import_torch
from ray.tune.registry import register_env
# from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv
from ray.rllib.env.pettingzoo_env import PettingZooEnv
from pettingzoo.sisl import waterworld_v3

tf1, tf, tfv = try_import_tf()
torch, nn = try_import_torch()

n_pursuers = 5
n_sensors = 30
obs_coord = n_sensors * (5 + 3)   # 3 for speed features enabled (default)
obs_dim = obs_coord + 2     # obs_dim size = 242 for 1 pursuer (agent)
act_dim = 2                 # act_dim size = 2 for 1 pursuer (agent)
################## TF model #################################################

class CentralizedCriticModel(TFModelV2):
    """Multi-agent model that implements a centralized value function."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,name):
        super(CentralizedCriticModel, self).__init__(obs_space, action_space, num_outputs, model_config, name)

        # Base of the model
        self.model = FullyConnectedNetwork(obs_space, action_space, num_outputs, model_config, name)

        self.register_variables(self.model.variables())

        n_agents = n_pursuers  # ---> opp_obs and opp_acts now consist of 4 (n_puesuers - 1) different agents
        # obs = obs_dim
        # act = 2
        opp_obs_dim = obs_dim * (n_agents - 1)
        opp_acts_dim = act_dim * (n_agents - 1)

        # Central VF maps (obs, opp_obs, opp_act) -> vf_pred
        obs = tf.keras.layers.Input(shape=(obs_dim, ), name="obs")
        opp_obs = tf.keras.layers.Input(shape=(opp_obs_dim, ), name="opp_obs")
        opp_act = tf.keras.layers.Input(shape=(opp_acts_dim, ), name="opp_act")
        concat_obs = tf.keras.layers.Concatenate(axis=1)([obs, opp_obs, opp_act])
        central_vf_dense = tf.keras.layers.Dense(16, activation=tf.nn.tanh, name="c_vf_dense")(concat_obs)
        central_vf_out = tf.keras.layers.Dense(1, activation=None, name="c_vf_out")(central_vf_dense)
        self.central_vf = tf.keras.Model(inputs=[obs, opp_obs, opp_act], outputs=central_vf_out)

        self.register_variables(self.central_vf.variables)

    @override(ModelV2)
    def forward(self, input_dict, state, seq_lens):
        return self.model.forward(input_dict, state, seq_lens)

    # def central_value_function(self, obs, opponent_obs, opponent_actions):
    #     return tf.reshape(
    #         self.central_vf([
    #             obs, opponent_obs,
    #             tf.one_hot(tf.cast(opponent_actions, tf.int32), 2)    # waterworld has 2 actions
    #         ]), [-1])
    def central_value_function(self, obs, opponent_obs, opponent_actions):
        return tf.reshape(
            self.central_vf([
                obs, opponent_obs, opponent_actions]), [-1])

    @override(ModelV2)
    def value_function(self):
        return self.model.value_function()  # not used

################## Torch model #################################################

class TorchCentralizedCriticModel(TorchModelV2, nn.Module):
    """Multi-agent model that implements a centralized VF."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                              model_config, name)
        nn.Module.__init__(self)

        n_agents = n_pursuers  # ---> opp_obs and opp_acts now consist of 4 (n_puesuers - 1) different agent information
        # obs = obs_dim
        # act = 2
        opp_obs_dim = obs_dim * (n_agents - 1)
        opp_acts_dim = act_dim * (n_agents - 1)

        # Base of the model
        self.model = TorchFC(obs_space, action_space, num_outputs,
                             model_config, name)

        # Central VF maps (obs, opp_obs, opp_act) -> vf_pred
        input_size = obs_dim + opp_obs_dim + opp_acts_dim  # obs + opp_obs + opp_act
        self.central_vf = nn.Sequential(
            SlimFC(input_size, 16, activation_fn=nn.Tanh),
            SlimFC(16, 1),
        )

    @override(ModelV2)
    def forward(self, input_dict, state, seq_lens):
        model_out, _ = self.model(input_dict, state, seq_lens)
        return model_out, []

    # def central_value_function(self, obs, opponent_obs, opponent_actions):
    #     input_ = torch.cat([
    #         obs, opponent_obs,
    #         torch.nn.functional.one_hot(opponent_actions.long(), 2).float()
    #     ], 1)
    #     return torch.reshape(self.central_vf(input_), [-1])

    def central_value_function(self, obs, opponent_obs, opponent_actions):
        input_ = torch.cat([obs, opponent_obs, opponent_actions], 1)
        return torch.reshape(self.central_vf(input_), [-1])

    @override(ModelV2)
    def value_function(self):
        return self.model.value_function()  # not used
#######################################################################################################################

OPPONENT_OBS = "opponent_obs"
OPPONENT_ACTION = "opponent_action"

parser = argparse.ArgumentParser()
parser.add_argument("--torch", action="store_true")
parser.add_argument("--as-test", action="store_true")
parser.add_argument("--stop-iters", type=int, default=100)
parser.add_argument("--stop-timesteps", type=int, default=100000)
parser.add_argument("--stop-reward", type=float, default=7.99)

class CentralizedValueMixin:
    """Add method to evaluate the central value function from the model."""

    def __init__(self):
        if self.config["framework"] != "torch":
            self.compute_central_vf = make_tf_callable(self.get_session())(
                self.model.central_value_function)
        else:
            self.compute_central_vf = self.model.central_value_function

# Grabs the opponent obs/act and includes it in the experience train_batch,
# and computes GAE using the central vf predictions.
def centralized_critic_postprocessing(policy,
                                      sample_batch,
                                      other_agent_batches=None,
                                      episode=None):
    pytorch = policy.config["framework"] == "torch"
    if (pytorch and hasattr(policy, "compute_central_vf")) or \
            (not pytorch and policy.loss_initialized()):
        assert other_agent_batches is not None
        # [(_, opponent_batch)] = list(other_agent_batches.values())

        # ---> opponent batch now consists of 4 SampleBatches, so I concatenate them

        concat_opponent_batch = SampleBatch.concat_samples(
            [opponent_n_batch for _, opponent_n_batch in other_agent_batches.values()])
        opponent_batch = concat_opponent_batch

        # also record the opponent obs and actions in the trajectory
        sample_batch[OPPONENT_OBS] = opponent_batch[SampleBatch.CUR_OBS]
        sample_batch[OPPONENT_ACTION] = opponent_batch[SampleBatch.ACTIONS]

        # overwrite default VF prediction with the central VF
        if args.torch:
            sample_batch[SampleBatch.VF_PREDS] = policy.compute_central_vf(
                convert_to_torch_tensor(
                    sample_batch[SampleBatch.CUR_OBS], policy.device),
                convert_to_torch_tensor(
                    sample_batch[OPPONENT_OBS], policy.device),
                convert_to_torch_tensor(
                    sample_batch[OPPONENT_ACTION], policy.device)) \
                .cpu().detach().numpy()
        else:
            sample_batch[SampleBatch.VF_PREDS] = policy.compute_central_vf(
                sample_batch[SampleBatch.CUR_OBS], sample_batch[OPPONENT_OBS],
                sample_batch[OPPONENT_ACTION])
    else:

        # Policy hasn't been initialized yet, use zeros.
        sample_batch[OPPONENT_OBS] = np.zeros_like([np.zeros(obs_dim * (n_pursuers - 1))])
        sample_batch[OPPONENT_ACTION] = np.zeros_like([np.zeros(act_dim * (n_pursuers - 1))])
        ### I think I don't have to change this
        sample_batch[SampleBatch.VF_PREDS] = np.zeros_like(sample_batch[SampleBatch.REWARDS], dtype=np.float32)

    completed = sample_batch["dones"][-1]
    if completed:
        last_r = 0.0
    else:
        last_r = sample_batch[SampleBatch.VF_PREDS][-1]

    train_batch = compute_advantages(
        sample_batch,
        last_r,
        policy.config["gamma"],
        policy.config["lambda"],
        use_gae=policy.config["use_gae"])
    return train_batch

# Copied from PPO but optimizing the central value function.
def loss_with_central_critic(policy, model, dist_class, train_batch):
    CentralizedValueMixin.__init__(policy)
    func = tf_loss if not policy.config["framework"] == "torch" else torch_loss

    vf_saved = model.value_function
    model.value_function = lambda: policy.model.central_value_function(
        train_batch[SampleBatch.CUR_OBS], train_batch[OPPONENT_OBS],
        train_batch[OPPONENT_ACTION])

    policy._central_value_out = model.value_function()
    loss = func(policy, model, dist_class, train_batch)

    model.value_function = vf_saved

    return loss

def setup_tf_mixins(policy, obs_space, action_space, config):
    # Copied from PPOTFPolicy (w/o ValueNetworkMixin).
    KLCoeffMixin.__init__(policy, config)
    EntropyCoeffSchedule.__init__(policy, config["entropy_coeff"],
                                  config["entropy_coeff_schedule"])
    LearningRateSchedule.__init__(policy, config["lr"], config["lr_schedule"])

def setup_torch_mixins(policy, obs_space, action_space, config):
    # Copied from PPOTorchPolicy  (w/o ValueNetworkMixin).
    TorchKLCoeffMixin.__init__(policy, config)
    TorchEntropyCoeffSchedule.__init__(policy, config["entropy_coeff"],
                                       config["entropy_coeff_schedule"])
    TorchLR.__init__(policy, config["lr"], config["lr_schedule"])

def central_vf_stats(policy, train_batch, grads):
    # Report the explained variance of the central value function.
    return {
        "vf_explained_var": explained_variance(
            train_batch[Postprocessing.VALUE_TARGETS],
            policy._central_value_out),
    }

CCPPOTFPolicy = PPOTFPolicy.with_updates(
    name="CCPPOTFPolicy",
    postprocess_fn=centralized_critic_postprocessing,
    loss_fn=loss_with_central_critic,
    before_loss_init=setup_tf_mixins,
    grad_stats_fn=central_vf_stats,
    mixins=[
        LearningRateSchedule, EntropyCoeffSchedule, KLCoeffMixin,
        CentralizedValueMixin
    ])

CCPPOTorchPolicy = PPOTorchPolicy.with_updates(
    name="CCPPOTorchPolicy",
    postprocess_fn=centralized_critic_postprocessing,
    loss_fn=loss_with_central_critic,
    before_init=setup_torch_mixins,
    mixins=[
        TorchLR, TorchEntropyCoeffSchedule, TorchKLCoeffMixin,
        CentralizedValueMixin
    ])

def get_policy_class(config):
    if config["framework"] == "torch":
        return CCPPOTorchPolicy

CCTrainer = PPOTrainer.with_updates(
    name="CCPPOTrainer",
    default_policy=CCPPOTFPolicy,
    get_policy_class=get_policy_class,
)

if __name__ == "__main__":
    ray.init()
    args = parser.parse_args()

    def env_creator(args):
        return PettingZooEnv(waterworld_v3.env(n_pursuers=5, n_evaders=5, n_sensors=30))

    env = env_creator({})
    register_env("waterworld", env_creator)

    obs_space = env.observation_space
    action_space = env.action_space
    policies = {agent: (None, obs_space, action_space, {}) for agent in env.agents}

    ModelCatalog.register_custom_model(
        "cc_model", TorchCentralizedCriticModel
        if args.torch else CentralizedCriticModel)

    config = {
        "env": "waterworld",
        "batch_mode": "complete_episodes",
        # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
        "num_workers": 1,
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": (lambda agent_id: agent_id),
        },
        "model": {
            "custom_model": "cc_model",
        },
        "framework": "torch" if args.torch else "tf",
    }

    stop = {
        "training_iteration": args.stop_iters,
        "timesteps_total": args.stop_timesteps,
        "episode_reward_mean": args.stop_reward,
    }

    results = tune.run(CCTrainer, config=config, stop=stop, verbose=1)

    if args.as_test:
        check_learning_achieved(results, args.stop_reward)

Full error message:

/home/george/PycharmProjects/ray_venv_v1_2_0/venv/bin/python /home/george/PycharmProjects/ray_venv_v1_2_0/cen_critic.py
WARNING:tensorflow:From /home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2021-04-16 11:29:58,069 WARNING deprecation.py:34 -- DeprecationWarning: `ray.rllib.env.pettingzoo_env.PettingZooEnv` has been deprecated. Use `ray.rllib.env.wrappers.pettingzoo_env.PettingZooEnv` instead. This will raise an error in the future!
2021-04-16 11:29:58,595 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
== Status ==
Memory usage on this node: 6.0/15.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 2/8 CPUs, 0/0 GPUs, 0.0/6.59 GiB heap, 0.0/2.25 GiB objects
Result logdir: /home/george/ray_results/CCPPOTrainer_2021-04-16_11-30-01
Number of trials: 1/1 (1 RUNNING)

(pid=13245) WARNING:tensorflow:From /home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=13245) Instructions for updating:
(pid=13245) non-resource variables are not supported in the long term
(pid=13245) 2021-04-16 11:30:03,717 INFO trainer.py:616 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=13245) 2021-04-16 11:30:03,717 INFO trainer.py:643 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=13249) WARNING:tensorflow:From /home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=13249) Instructions for updating:
(pid=13249) non-resource variables are not supported in the long term
(pid=13249) 2021-04-16 11:30:06,005 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13249) 2021-04-16 11:30:06,082 WARNING deprecation.py:34 -- DeprecationWarning: `TFModelV2.register_variables` has been deprecated. This will raise an error in the future!
(pid=13249) 2021-04-16 11:30:06,849 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13249) 2021-04-16 11:30:07,939 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13249) 2021-04-16 11:30:09,244 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13249) 2021-04-16 11:30:10,700 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13245) 2021-04-16 11:30:12,491 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13245) 2021-04-16 11:30:12,577 WARNING deprecation.py:34 -- DeprecationWarning: `TFModelV2.register_variables` has been deprecated. This will raise an error in the future!
(pid=13245) 2021-04-16 11:30:13,427 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13245) 2021-04-16 11:30:14,608 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13245) 2021-04-16 11:30:15,878 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13245) 2021-04-16 11:30:17,377 WARNING deprecation.py:34 -- DeprecationWarning: `framestack` has been deprecated. Use `num_framestacks (int)` instead. This will raise an error in the future!
(pid=13245) 2021-04-16 11:30:33,513 INFO trainable.py:103 -- Trainable.setup took 29.812 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=13245) 2021-04-16 11:30:33,513 WARNING util.py:47 -- Install gputil for GPU system monitoring.
(pid=13249) 2021-04-16 11:30:33,521 WARNING deprecation.py:34 -- DeprecationWarning: `env_index` has been deprecated. Use `episode.env_id` instead. This will raise an error in the future!
2021-04-16 11:30:38,216 ERROR trial_runner.py:616 -- Trial CCPPOTrainer_waterworld_efa13_00000: Error processing event.
Traceback (most recent call last):
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::CCPPOTrainer.train_buffered() (pid=13245, ip=192.168.1.6)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 526, in train
    raise e
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 515, in train
    result = Trainable.train(self)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 148, in step
    res = next(self.train_exec_impl)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 876, in apply_flatten
    for item in it:
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 828, in add_wait_hooks
    item = next(it)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 471, in base_iterator
    yield ray.get(futures, timeout=timeout)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayTaskError(ValueError): ray::RolloutWorker.par_iter_next() (pid=13249, ip=192.168.1.6)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/util/iter.py", line 1152, in par_iter_next
    return next(self.local_it)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 327, in gen_rollouts
    yield self.sample()
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 662, in sample
    batches = [self.input_reader.next()]
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 95, in next
    batches = [self.get_data()]
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 224, in get_data
    item = next(self.rollout_provider)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 620, in _env_runner
    sample_collector=sample_collector,
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 1141, in _process_observations_w_trajectory_view_api
    build=not multiple_episodes_in_batch)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 669, in postprocess_episode
    post_batches[agent_id], other_batches, episode)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/policy/tf_policy_template.py", line 246, in postprocess_trajectory
    episode)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/cen_critic.py", line 205, in centralized_critic_postprocessing
    sample_batch[OPPONENT_ACTION])
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/rllib/utils/tf_ops.py", line 178, in call
    **kwargs_placeholders)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/cen_critic.py", line 90, in central_value_function
    obs, opponent_obs, opponent_actions]), [-1])
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 761, in __call__
    self.name)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/input_spec.py", line 274, in assert_input_compatibility
    ', found shape=' + display_shape(x.shape))
ValueError: Input 1 is incompatible with layer model_1: expected shape=(None, 968), found shape=(2000, 242)
== Status ==
Memory usage on this node: 7.0/15.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/6.59 GiB heap, 0.0/2.25 GiB objects
Result logdir: /home/george/ray_results/CCPPOTrainer_2021-04-16_11-30-01
Number of trials: 1/1 (1 ERROR)
Number of errored trials: 1
+-------------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------+
| Trial name                          |   # failures | error file                                                                                                                    |
|-------------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------|
| CCPPOTrainer_waterworld_efa13_00000 |            1 | /home/george/ray_results/CCPPOTrainer_2021-04-16_11-30-01/CCPPOTrainer_waterworld_efa13_00000_0_2021-04-16_11-30-01/error.txt |
+-------------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------+

== Status ==
Memory usage on this node: 7.0/15.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/6.59 GiB heap, 0.0/2.25 GiB objects
Result logdir: /home/george/ray_results/CCPPOTrainer_2021-04-16_11-30-01
Number of trials: 1/1 (1 ERROR)
Number of errored trials: 1
+-------------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------+
| Trial name                          |   # failures | error file                                                                                                                    |
|-------------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------|
| CCPPOTrainer_waterworld_efa13_00000 |            1 | /home/george/ray_results/CCPPOTrainer_2021-04-16_11-30-01/CCPPOTrainer_waterworld_efa13_00000_0_2021-04-16_11-30-01/error.txt |
+-------------------------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------+

Traceback (most recent call last):
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/cen_critic.py", line 348, in <module>
    results = tune.run(CCTrainer, config=config, stop=stop, verbose=1)
  File "/home/george/PycharmProjects/ray_venv_v1_2_0/venv/lib/python3.6/site-packages/ray/tune/tune.py", line 444, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [CCPPOTrainer_waterworld_efa13_00000])

Process finished with exit code 1

Thank in advance!

Best, George

mvindiola1 commented 3 years ago

Hi @george-skal,

The issue is in centralized_critic_postprocessing. The SampleBatch concat_samples is concatenating in the batch dimension.

print(SampleBatch.concat_samples(
    [opponent_n_batch for _, opponent_n_batch in other_agent_batches.values()])["obs"].shape)
(2000, 242)

This should do the trick:

sample_batch[OPPONENT_OBS] = np.concatenate([opponent_batch[SampleBatch.CUR_OBS] for
                                                 _, opponent_batch in other_agent_batches.values()],
                                                -1)
sample_batch[OPPONENT_ACTION] = np.concatenate([opponent_batch[SampleBatch.ACTIONS] for
                                                _, opponent_batch in other_agent_batches.values()],
                                                -1)

george-skal commented 3 years ago

Hi @mvindiola1 ,

Thank you very much for your help. I tried your solution and now the code works with tensorflow, but not with torch. I am using the code from the example centralized_critic.py and with torch I get a similar to the previous error:

RuntimeError: Sizes of tensors must match except in dimension 1. Got 32 and 1 in dimension 0 (The offending index is 1)

I see that in the central_value_function I have in torch:

 ```
    obs.shape:                  torch.Size([32, 242])                            
    opponent_obs.shape:         torch.Size([1, 968])
    opponent_actions.shape:     torch.Size([1, 8])

```

while in tensorflow they are:

        (?, 242)
        (?, 968)
        (?, 8)

so maybe the problem is that the first dimension in torch is not None. Do you have any idea how to fix it?

Thanks in advance. Best regards, George

mvindiola1 commented 3 years ago

@george-skal,

Glad that worked for you. This time the error is coming from the else branch when the loss is being initialized to infer the trajectory view information. When you were creating the dummy opponent obs and action you forgot to include the batch size information. This should fix the issue for you.

if (pytorch and hasattr(policy, "compute_central_vf")) or \
            (not pytorch and policy.loss_initialized()):
    ...
else:
        # Policy hasn't been initialized yet, use zeros
        batch_size = sample_batch[SampleBatch.CUR_OBS].shape[0]
        sample_batch[OPPONENT_OBS] = np.zeros((batch_size,obs_dim * (n_pursuers - 1)))
        sample_batch[OPPONENT_ACTION] = np.zeros((batch_size,act_dim * (n_pursuers - 1)))

george-skal commented 3 years ago

Sorry for the late reply. These solutions work fine so the issue can get closed.

Thanks, George

lyzyn commented 10 months ago

Hello, I am using RAY to customize a centralized critical network. As a beginner, I have encountered many doubts and problems. Do I still need to override ModelV2 when customizing? Thank you! Looking forward to your reply!

class CentralizedCriticModel(TFModelV2):
    @override(ModelV2)
    def forward(self, input_dict, state, seq_lens):
        return self.model.forward(input_dict, state, seq_lens)
    @override(ModelV2)
    def value_function(self):
        return self.model.value_function()  # not used

ray-project / ray

[rllib] Error with centralised critic PPO for multiagent env (pettingzoo waterworld) #15363