The Rllib example is broken with Ray 1.6.0?

heng2j commented 2 years ago

Hi Team,

I am trying to reproduce the result with your Rllib example with Rllib 1.6.0. However, I am seeing serialization errors. Which RLlib version did your team use? And which python version you were running? Starting with Rllib 1.6.0, it requires to be running with Python 3.8+

Given Rllib example

import torch
import torch_geometric

from ray import tune
from ray.rllib.examples.env.stateless_cartpole import StatelessCartPole

from models.ray_graph import RayObsGraph
from models.edge_selectors.temporal import TemporalBackedge

our_gnn = torch_geometric.nn.Sequential(
    "x, adj, weights, B, N",
    [
        (torch_geometric.nn.DenseGraphConv(32, 32), "x, adj -> x"),
        (torch.nn.Tanh()),
        (torch_geometric.nn.DenseGraphConv(32, 32), "x, adj -> x"),
        (torch.nn.Tanh()),
    ],
)
ray_cfg = {
   "env": StatelessCartPole, # Replace this with your desired env
   "framework": "torch",
   "model": {
      "custom_model": RayObsGraph,
      "custom_model_config": {
         "gnn_input_size": 32,
         "gnn_output_size": 32,
         "gnn": our_gnn,
         "edge_selectors": TemporalBackedge([1])
      }
   }
}
tune.run("PPO", config=ray_cfg)

Serialization errors

== Status ==
Memory usage on this node: 12.2/125.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/64 CPUs, 0/2 GPUs, 0.0/73.03 GiB heap, 0.0/35.29 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: .../ray_results/PPO
Number of trials: 1/1 (1 PENDING)
+-----------------------------------+----------+-------+
| Trial name                        | status   | loc   |
|-----------------------------------+----------+-------|
| PPO_StatelessCartPole_d2ce0_00000 | PENDING  |       |
+-----------------------------------+----------+-------+

2021-10-01 14:46:04,966 ERROR trial_runner.py:773 -- Trial PPO_StatelessCartPole_d2ce0_00000: Error processing event.
Traceback (most recent call last):
  File "/.../lib/python3.8/site-packages/ray/tune/trial_runner.py", line 739, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/.../lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 746, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/.../lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/.../lib/python3.8/site-packages/ray/worker.py", line 1623, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=1114116, ip=10.1.14.58)
  Some of the input arguments for this task could not be computed:
ray.exceptions.RaySystemError: System error: No module named 'Sequential_d2ce07'
traceback: Traceback (most recent call last):
  File "/.../lib/python3.8/site-packages/ray/serialization.py", line 254, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
  File "/.../lib/python3.8/site-packages/ray/serialization.py", line 190, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata_fields)
  File "/.../python3.8/site-packages/ray/serialization.py", line 168, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
  File "/.../python3.8/site-packages/ray/serialization.py", line 158, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
ModuleNotFoundError: No module named 'Sequential_d2ce07'
Result for PPO_StatelessCartPole_d2ce0_00000:
  {}

smorad commented 2 years ago

I'm currently running python-3.8.10, ray-1.6.0, torch-1.9.0, and torch_geometric-1.7.2 on my machine. This repo is frozen for the paper. I will add a ray wrapper to https://github.com/smorad/graph-conv-memory in a week or so

smorad commented 2 years ago

Basic ray wrapper and test was added in https://github.com/smorad/graph-conv-memory/commit/4590012ca1bdbab78bf6ab71b3590683540fddf5 and README was updated with a ray rllib example in https://github.com/smorad/graph-conv-memory/commit/22a4ac2dbdff7b995c02e25a4bcfed14c1fb6371. See if you can run the new unit test or example using your package versions.

heng2j commented 2 years ago

Thank you @smorad, with your latest updates and changes, I am able to pass the new unit tests and get started to train with Ray 1.6.0. And our packages versions are the same. Thank you for your time and effort for this quick turn around!

heng2j commented 2 years ago

Hi @smorad, sorry for bothering you again. Good news that we are able to train GCM for our task. Bad news is we are not able to train it at scale. The serialization error persist. if I tried to train with Ray Local Mode set to False. Local Mode forces all Ray functions to occur on a single process. So despite we have more compute resources. Do you have any suggestion for how to train GCM at scale?

smorad commented 2 years ago

Local mode is just in the unit test to make it run faster. During training, I have local_mode=False. I usually have 8 worker processes and a single trainer process in Ray. This is all done on a single server, I have not tested across multiple servers.

Are you running on a single host or multiple hosts?

heng2j commented 2 years ago

I am running on single host.

And here is the sample testing demo for reproducibility:

def test_Ray_gcm():
    hidden = 32
    graph_size = 32
    ray.init(
        local_mode=False,
        object_store_memory=3e10,
    )
    dgc = torch_geometric.nn.Sequential(
        "x, adj, weights, B, N",
        [
            # Mean and sum aggregation perform roughly the same
            # Preprocessor with 1 layer did not help
            (torch_geometric.nn.DenseGraphConv(hidden, hidden), "x, adj -> x"),
            (torch.nn.Tanh()),
            (torch_geometric.nn.DenseGraphConv(hidden, hidden), "x, adj -> x"),
            (torch.nn.Tanh()),
        ],
    )
    cfg = {
        "framework": "torch",
        "num_gpus": 2,
        "seed": 3,
        "env": "CartPole-v0",
        "num_workers": 8,
        "model": {
            "custom_model": RayDenseGCM,
            "custom_model_config": {
                "graph_size": graph_size,
                "gnn_input_size": hidden,
                "gnn_output_size": hidden,
                "gnn": dgc,
                "edge_weights": False,
                "edge_selectors": "TemporalBackedge",
                "edge_selectors_params": {
                    "hops": [1],
                    "direction": "forward",
                    "learned": False,
                    "learning_window": 10,
                    "deterministic": False,
                    "num_samples": 3,
                },
            },
            "max_seq_len": graph_size + 1,
        },
    }

    # Note: using A2C is much faster for testing than PPO
    tune.run("A2C", config=cfg, stop={"info/num_steps_trained": 100})

    ray.shutdown()

heng2j commented 2 years ago

FYI, GCM is giving us significant performance improvement for our task when train with local mode. So we would like to train it at scale to have a apple to apple comparison.

smorad commented 2 years ago

The config options for GCM are not strings, but objects. I modified your example and the following works for me

import sys
import torch
import ray
import torch_geometric
from ray import tune
from gcm.ray_gcm import RayDenseGCM
from gcm.edge_selectors.temporal import TemporalBackedge

def test_Ray_gcm():
    print('Ray version', ray.__version__)
    print('Torch version', torch.__version__)
    print('Torch geometric version', torch_geometric.__version__)
    print('Python version', sys.version_info)
    hidden = 32
    graph_size = 32
    ray.init(
        local_mode=False,
        object_store_memory=3e10,
    )
    dgc = torch_geometric.nn.Sequential(
        "x, adj, weights, B, N",
        [
            # Mean and sum aggregation perform roughly the same
            # Preprocessor with 1 layer did not help
            (torch_geometric.nn.DenseGraphConv(hidden, hidden), "x, adj -> x"),
            (torch.nn.Tanh()),
            (torch_geometric.nn.DenseGraphConv(hidden, hidden), "x, adj -> x"),
            (torch.nn.Tanh()),
        ],
    )
    cfg = {
        "framework": "torch",
        "num_gpus": 2,
        "seed": 3,
        "env": "CartPole-v0",
        "num_workers": 8,
        "model": {
            "custom_model": RayDenseGCM,
            "custom_model_config": {
                "graph_size": graph_size,
                "gnn_input_size": hidden,
                "gnn_output_size": hidden,
                "gnn": dgc,
                "edge_weights": False,
                "edge_selectors": TemporalBackedge([1]),
            },
            "max_seq_len": graph_size + 1,
        },
    }

    # Note: using A2C is much faster for testing than PPO
    tune.run("A2C", config=cfg, stop={"info/num_steps_trained": 100})

    ray.shutdown()

test_Ray_gcm()

This produces the following output for me

root@64f1bf4c34b3:~/vnav/src# python3 test.py                                                                                                                            [264/264]
Ray version 1.6.0
Torch version 1.9.0+cu111
Torch geometric version 1.7.2
Python version sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
2021-10-05 12:14:01,333 INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8266
== Status ==
Memory usage on this node: 47.2/187.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/96 CPUs, 0/4 GPUs, 0.0/116.42 GiB heap, 0.0/27.94 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/A2C
Number of trials: 1/1 (1 PENDING)
+-----------------------------+----------+-------+
| Trial name                  | status   | loc   |
|-----------------------------+----------+-------|
| A2C_CartPole-v0_ba413_00000 | PENDING  |       |
+-----------------------------+----------+-------+

2021-10-05 12:14:02,884 ERROR syncer.py:72 -- Log sync requires rsync to be installed.
(pid=4819) 2021-10-05 12:14:05,203      INFO trainer.py:726 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=4800) Full GCM network is: RayDenseGCM(
(pid=4800)   (gcm): DenseGCM(
(pid=4800)     (preprocessor): Linear(in_features=4, out_features=32, bias=True)
(pid=4800)     (gnn): Sequential(
(pid=4800)       (0): DenseGraphConv(32, 32)
(pid=4800)       (1): Tanh()
(pid=4800)       (2): DenseGraphConv(32, 32)
(pid=4800)       (3): Tanh()
(pid=4800)     )
(pid=4800)     (edge_selectors): TemporalBackedge()
(pid=4800)   )
(pid=4800)   (logit_branch): SlimFC(
(pid=4800)     (_model): Sequential(
(pid=4800)       (0): Linear(in_features=32, out_features=2, bias=True)
(pid=4800)     )

...

Using FIFO scheduling algorithm.
Resources requested: 0/96 CPUs, 0/4 GPUs, 0.0/116.42 GiB heap, 0.0/27.94 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/A2C
Number of trials: 1/1 (1 TERMINATED)
+-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status     | loc   |   iter |   total time (s) |   ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------|
| A2C_CartPole-v0_ba413_00000 | TERMINATED |       |      1 |         0.563771 |  320 |  15.6667 |                   21 |                   10 |            15.6667 |
+-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------+

== Status ==
Memory usage on this node: 54.2/187.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/96 CPUs, 0/4 GPUs, 0.0/116.42 GiB heap, 0.0/27.94 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/A2C
Number of trials: 1/1 (1 TERMINATED)
+-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status     | loc   |   iter |   total time (s) |   ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------|
| A2C_CartPole-v0_ba413_00000 | TERMINATED |       |      1 |         0.563771 |  320 |  15.6667 |                   21 |                   10 |            15.6667 |
+-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------+

2021-10-05 12:14:16,102 INFO tune.py:561 -- Total run time: 13.63 seconds (13.32 seconds for the tuning loop).

smorad commented 2 years ago

Your error message suggests some sort of pytorch/pytorch_geometric/ray issue. Are you executing ray in anaconda or something? Is it possible that the worker processes do not have the same python libraries available as your parent process? This would explain why this works in local mode but not distributed mode.

heng2j commented 2 years ago

Thank you @smorad. Yeah I actually modified ray_gcm.py so I can pass the edge_selector params from a config file.

# GCM Edge  Selectors
from ray_graph_conv_memory.gcm.edge_selectors.temporal import TemporalBackedge

class RayDenseGCM(TorchModelV2, nn.Module):

    EDGE_SELECTORS = {"TemporalBackedge": TemporalBackedge}
...

        # Config Edge selectors
        self.cfg["edge_selectors"] = self.EDGE_SELECTORS[custom_model_kwargs["edge_selectors"]](**custom_model_kwargs["edge_selectors_params"])

And I got the same error when I tried to revert my changes and run your example above.

Here is my package versions. And I noticed that I am using Torch geometric version 2.0.1 instead of 1.7.2 like yours. So I may try to downgrade my Torch geometric to see if this issue persist.

Ray version 1.6.0
Torch version 1.9.0+cu111
Torch geometric version 2.0.1
Python version sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)

For your suggestion on package discrepancies between workers and parent, I pip installed ray into conda env and run my test in VScode Debugger.

I think another issue can be I hand picked your recent changes and copied to my project to fit my project's directory structure. So I will double check to ensure I have all the updated code.

heng2j commented 2 years ago

Hi @smorad, okay that was it. The issue was Torch geometric version 2.0.1. After I downgraded to 1.7.2. The example works.

heng2j commented 2 years ago

Thank you so much for your help @smorad!

smorad / graph-conv-memory-paper

The Rllib example is broken with Ray 1.6.0? #1

Given Rllib example

Serialization errors