[RLlib] Error on self-play with Simple_tag

What happened + What you expected to happen

Hi, I am using a self-play scheme on SImple_tag_v2 of Pettingzoo, that works on a previous installation of ray_300_dev0 and al old ray 1.2.0 (with modification on the code for tune), but has an error on ray 2.3.1 and 2.4 and also if I install again a new ray_300_dev0. It seems there is a problem with newer version of some packages, since it works on the old ray_300_dev0, but I can't find which ones. It doesn't seem to have to do with pettingzoo, since I am using the same versions. The error is:

File "/home/george/PycharmProjects/ray_240_venv/venv/lib/python3.10/site-packages/ray/rllib/evaluation/postprocessing.py", line 117, in compute_advantages (PPO pid=233365) delta_t = rollout[SampleBatch.REWARDS] + gamma * vpred_t[1:] - vpred_t[:-1] (PPO pid=233365) ValueError: operands could not be broadcast together with shapes (14,) (13,)

I use for the weights sharing the method proposed here with deepcopy https://discuss.ray.io/t/policy-weights-overwritten-in-self-play/2520 since there was a bug [that](https://github.com/ray-project/ray/issues/16718) I am not sure if it is fixed. Can it be the problem?

Please also find attached the full error file. error.txt

Thanks, George

Versions / Dependencies

ray 2.4.0 (but also 2.3.0, 2.3.1) torch 2.0.0 pettingzoo 1.22.3 supersuit 3.7.1 python 3.10

Reproduction script

from ray import air, tune
import ray
from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv
from supersuit import pad_observations_v0
from pettingzoo.mpe import simple_tag_v2
from ray.rllib.algorithms.callbacks import DefaultCallbacks
import argparse
import numpy as np
import copy

M = 10  # Menagerie size

class MyCallbacks(DefaultCallbacks):

    def __init__(self):
        super(MyCallbacks, self).__init__()
        self.nan_counter = 0
        self.men = []
        self.men2 = []
        self.men_rewards = []

    def on_train_result(self, *, algorithm, result: dict, **kwargs):
        print(
            "Algorithm.train() result: {} -> {} episodes".format(
                algorithm, result["episodes_this_iter"]
            )
        )
        k = result['training_iteration']  # starts from 1

        # the "shared_policy_1" is the only agent being trained
        if np.isnan(result['episode_reward_mean']):
            # global men_start, nan_true
            # men_start = i
            self.nan_counter += 1  # flag for nana in the beginning
            pass
        else:
            if k <= M + self.nan_counter:
                # menagerie initialisation
                self.men.append(copy.deepcopy(algorithm.get_policy("shared_policy_1").get_weights()))
                self.men2.append(copy.deepcopy(algorithm.get_policy("shared_policy_2").get_weights()))

                weights = ray.put(algorithm.workers.local_worker().save())
                algorithm.workers.foreach_worker(
                    lambda w: w.restore(ray.get(weights))
                )

            else:
                self.men.pop(0)
                self.men2.pop(0)
                self.men.append(copy.deepcopy(algorithm.get_policy("shared_policy_1").get_weights()))
                self.men2.append(copy.deepcopy(algorithm.get_policy("shared_policy_2").get_weights()))

                sel = list(range(0, M))  # list index in python starts at 0
                # print("sel =", sel)

                choice = np.random.choice(sel)
                # print("choice is ", choice)

                algorithm.set_weights(
                    {"shared_policy_1": self.men[choice]  # weights or values from "policy_1" with "policy_0" keys
                     })

                choice = np.random.choice(sel)
                # print("choice is ", choice)

                algorithm.set_weights(
                    {"shared_policy_2": self.men2[choice]  # weights or values from "policy_1" with "policy_0" keys
                     })

                weights = ray.put(algorithm.workers.local_worker().save())
                algorithm.workers.foreach_worker(
                    lambda w: w.restore(ray.get(weights))
                )
            result["callback_ok"] = True

if __name__ == "__main__":

    for i in range(1, 2):

        def env_creator(args):
            env = simple_tag_v2.env(num_good=3, num_adversaries=6, num_obstacles=3, max_cycles=25)
            env = pad_observations_v0(env)
            return env

        register_env("simple_tag", lambda args: PettingZooEnv(env_creator(args)))

        test_env = PettingZooEnv(env_creator({}))

        obs_space = test_env.observation_space
        act_spc = test_env.action_space

        policies = {"shared_policy_1": (None, obs_space, act_spc, {}),
                    "shared_policy_2": (None, obs_space, act_spc, {})
                    # "pursuer_5": (None, obs_space, act_spc, {})
                    }

        policy_ids = list(policies.keys())

        def policy_mapping_fn(agent_id, episode, worker, **kwargs):
            if agent_id in ["agent_0", "agent_1", "agent_2"]:
                # print("agent_id", agent_id)
                return "shared_policy_1"
            else:
                # print("agent_id", agent_id)

                return "shared_policy_2"

        config = (
            PPOConfig()
            .environment("simple_tag")
            .resources(num_gpus=1, num_cpus_for_local_worker=8)
            .rollouts(num_rollout_workers=4)  # default = 2 (I should try it)
            .callbacks(MyCallbacks)
            .framework("torch")
            .multi_agent(
                policies=policies,
                policy_mapping_fn=policy_mapping_fn,
            )
        )

        tune.Tuner(
            "PPO",
            run_config=air.RunConfig(
                name="simple_tag 363 plain self play test trial {0}g".format(i),
                stop={"training_iteration": 1500},
                checkpoint_config=air.CheckpointConfig(
                    checkpoint_frequency=10,
                ),
            ),
            param_space=config.to_dict(),
        ).fit()

Issue Severity

High: It blocks me from completing my task.

Thanks for reporting this!

HI @ArturNiederfahrenhorst

I tried the method of copying weights that you have on the new self-play examples to see what happens and it works only on CPU. When I use GPU I get the error:

File "/home/george/PycharmProjects/ray_240_venv/venv/lib/python3.10/site-packages/torch/optim/adam.py", line 449, in _multi_tensor_adam torch._foreachaddcmul(device_exp_avg_sqs, device_grads, device_grads, 1 - beta2) RuntimeError: Expected scalars to be on CPU, got cuda:0 instead.

This happens on the 11th iteration that the Callback is called first. This error is mentioned here https://github.com/ray-project/ray/issues/34159 as well, on some other cases.

Please have a look and let me know if there is any work around, because the only way I see is to use ray 1.11 or 1.12, but one of the environments uses only gymnasium that is supported on ray > 2.2, so it is not possible to use older versions. I also tried not using tune and run with .build, but it provided the same error on GPU.

Please find attached the code.

from ray import air, tune
import ray
from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv
from supersuit import pad_observations_v0
from pettingzoo.mpe import simple_tag_v2
from ray.rllib.algorithms.callbacks import DefaultCallbacks
import argparse
import numpy as np
import copy

M = 10  # Menagerie size

class MyCallbacks(DefaultCallbacks):

    def __init__(self):
        super(MyCallbacks, self).__init__()
        self.nan_counter = 0
        self.men = []
        self.men2 = []
        self.men_rewards = []

    def on_train_result(self, *, algorithm, result: dict, **kwargs):
        print(
            "Algorithm.train() result: {} -> {} episodes".format(
                algorithm, result["episodes_this_iter"]
            )
        )
        k = result['training_iteration']  # starts from 1

        # the "shared_policy_1" is the only agent being trained
        if np.isnan(result['episode_reward_mean']):
            # global men_start, nan_true
            # men_start = i
            self.nan_counter += 1  # flag for nana in the beginning
            pass
        else:
            if k <= M + self.nan_counter:
                # menagerie initialisation
                self.men.append(algorithm.get_policy("shared_policy_1").get_state())
                self.men2.append(algorithm.get_policy("shared_policy_2").get_state())

            else:
                self.men.pop(0)
                self.men2.pop(0)
                self.men.append(algorithm.get_policy("shared_policy_1").get_state())
                self.men2.append(algorithm.get_policy("shared_policy_2").get_state())

                sel = list(range(0, M))  # list index in python starts at 0
                # print("sel =", sel)

                choice = np.random.choice(sel)
                # print("choice is ", choice)
                algorithm.get_policy("shared_policy_1").set_state(self.men[choice])

                choice = np.random.choice(sel)
                # print("choice is ", choice)

                algorithm.get_policy("shared_policy_2").set_state(self.men2[choice])

                algorithm.workers.sync_weights()
            result["callback_ok"] = True

if __name__ == "__main__":

    for i in range(1, 2):

        def env_creator(args):
            env = simple_tag_v2.env(num_good=3, num_adversaries=6, num_obstacles=3, max_cycles=25)
            env = pad_observations_v0(env)
            return env

        register_env("simple_tag", lambda args: PettingZooEnv(env_creator(args)))

        test_env = PettingZooEnv(env_creator({}))

        obs_space = test_env.observation_space
        act_spc = test_env.action_space

        policies = {"shared_policy_1": (None, obs_space, act_spc, {}),
                    "shared_policy_2": (None, obs_space, act_spc, {})
                    # "pursuer_5": (None, obs_space, act_spc, {})
                    }

        policy_ids = list(policies.keys())

        def policy_mapping_fn(agent_id, episode, worker, **kwargs):
            if agent_id in ["agent_0", "agent_1", "agent_2"]:
                # print("agent_id", agent_id)
                return "shared_policy_1"
            else:
                # print("agent_id", agent_id)

                return "shared_policy_2"

        config = (
            PPOConfig()
            .environment("simple_tag")
            .resources(num_gpus=1)
            .rollouts(num_rollout_workers=4)  # default = 2 (I should try it)
            .callbacks(MyCallbacks)
            .framework("torch")
            .multi_agent(
                policies=policies,
                policy_mapping_fn=policy_mapping_fn,
            )
        )

        tune.Tuner(
            "PPO",
            run_config=air.RunConfig(
                name="simple_tag 363 plain self play test trial {0}g".format(i),
                stop={"training_iteration": 1500},
                checkpoint_config=air.CheckpointConfig(
                    checkpoint_frequency=10,
                ),
            ),
            param_space=config.to_dict(),
        ).fit()

Can also still reproduce this on ray 2.9.3.

File "/home/davidhozic/.local/lib/python3.10/site-packages/ray/rllib/evaluation/postprocessing.py", line 204, in compute_gae_for_sample_batch batch = compute_advantages( File "/home/davidhozic/.local/lib/python3.10/site-packages/ray/rllib/evaluation/postprocessing.py", line 128, in compute_advantages delta_t = rewards + gamma * vpred_t[1:] - vpred_t[:-1] ValueError: operands could not be broadcast together with shapes (101,) (100,)

Running into the same issue too. Ray 2.4.0, A3C and APPO algorithms, no self-play. Interestingly, it only seems to happen if I'm resuming training from a checkpoint, at the end of the first post-restore episode. Does not happen if I run the whole training loop without any restoring from checkpoint, nor do I experience it with the DQN algorithm.

Are there any plans to resolve this issue? For me, this happens when restoring an algorithm from multiple checkpoints. That is, I'm iterating over a checkpoint directory, where for each checkpoint, I call algo.restore() for the respective checkpoint. It seems I experience this failure after the second call to algo.restore()...

ray-project / ray