Closed heng2j closed 2 years ago
I'm currently running python-3.8.10
, ray-1.6.0
, torch-1.9.0
, and torch_geometric-1.7.2
on my machine. This repo is frozen for the paper. I will add a ray wrapper to https://github.com/smorad/graph-conv-memory in a week or so
Basic ray wrapper and test was added in https://github.com/smorad/graph-conv-memory/commit/4590012ca1bdbab78bf6ab71b3590683540fddf5 and README was updated with a ray rllib example in https://github.com/smorad/graph-conv-memory/commit/22a4ac2dbdff7b995c02e25a4bcfed14c1fb6371. See if you can run the new unit test or example using your package versions.
Thank you @smorad, with your latest updates and changes, I am able to pass the new unit tests and get started to train with Ray 1.6.0. And our packages versions are the same. Thank you for your time and effort for this quick turn around!
Hi @smorad, sorry for bothering you again. Good news that we are able to train GCM for our task. Bad news is we are not able to train it at scale. The serialization error persist. if I tried to train with Ray Local Mode set to False. Local Mode forces all Ray functions to occur on a single process. So despite we have more compute resources. Do you have any suggestion for how to train GCM at scale?
Local mode is just in the unit test to make it run faster. During training, I have local_mode=False
. I usually have 8 worker processes and a single trainer process in Ray. This is all done on a single server, I have not tested across multiple servers.
Are you running on a single host or multiple hosts?
I am running on single host.
And here is the sample testing demo for reproducibility:
def test_Ray_gcm():
hidden = 32
graph_size = 32
ray.init(
local_mode=False,
object_store_memory=3e10,
)
dgc = torch_geometric.nn.Sequential(
"x, adj, weights, B, N",
[
# Mean and sum aggregation perform roughly the same
# Preprocessor with 1 layer did not help
(torch_geometric.nn.DenseGraphConv(hidden, hidden), "x, adj -> x"),
(torch.nn.Tanh()),
(torch_geometric.nn.DenseGraphConv(hidden, hidden), "x, adj -> x"),
(torch.nn.Tanh()),
],
)
cfg = {
"framework": "torch",
"num_gpus": 2,
"seed": 3,
"env": "CartPole-v0",
"num_workers": 8,
"model": {
"custom_model": RayDenseGCM,
"custom_model_config": {
"graph_size": graph_size,
"gnn_input_size": hidden,
"gnn_output_size": hidden,
"gnn": dgc,
"edge_weights": False,
"edge_selectors": "TemporalBackedge",
"edge_selectors_params": {
"hops": [1],
"direction": "forward",
"learned": False,
"learning_window": 10,
"deterministic": False,
"num_samples": 3,
},
},
"max_seq_len": graph_size + 1,
},
}
# Note: using A2C is much faster for testing than PPO
tune.run("A2C", config=cfg, stop={"info/num_steps_trained": 100})
ray.shutdown()
FYI, GCM is giving us significant performance improvement for our task when train with local mode. So we would like to train it at scale to have a apple to apple comparison.
The config options for GCM are not strings, but objects. I modified your example and the following works for me
import sys
import torch
import ray
import torch_geometric
from ray import tune
from gcm.ray_gcm import RayDenseGCM
from gcm.edge_selectors.temporal import TemporalBackedge
def test_Ray_gcm():
print('Ray version', ray.__version__)
print('Torch version', torch.__version__)
print('Torch geometric version', torch_geometric.__version__)
print('Python version', sys.version_info)
hidden = 32
graph_size = 32
ray.init(
local_mode=False,
object_store_memory=3e10,
)
dgc = torch_geometric.nn.Sequential(
"x, adj, weights, B, N",
[
# Mean and sum aggregation perform roughly the same
# Preprocessor with 1 layer did not help
(torch_geometric.nn.DenseGraphConv(hidden, hidden), "x, adj -> x"),
(torch.nn.Tanh()),
(torch_geometric.nn.DenseGraphConv(hidden, hidden), "x, adj -> x"),
(torch.nn.Tanh()),
],
)
cfg = {
"framework": "torch",
"num_gpus": 2,
"seed": 3,
"env": "CartPole-v0",
"num_workers": 8,
"model": {
"custom_model": RayDenseGCM,
"custom_model_config": {
"graph_size": graph_size,
"gnn_input_size": hidden,
"gnn_output_size": hidden,
"gnn": dgc,
"edge_weights": False,
"edge_selectors": TemporalBackedge([1]),
},
"max_seq_len": graph_size + 1,
},
}
# Note: using A2C is much faster for testing than PPO
tune.run("A2C", config=cfg, stop={"info/num_steps_trained": 100})
ray.shutdown()
test_Ray_gcm()
This produces the following output for me
root@64f1bf4c34b3:~/vnav/src# python3 test.py [264/264]
Ray version 1.6.0
Torch version 1.9.0+cu111
Torch geometric version 1.7.2
Python version sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
2021-10-05 12:14:01,333 INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8266
== Status ==
Memory usage on this node: 47.2/187.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/96 CPUs, 0/4 GPUs, 0.0/116.42 GiB heap, 0.0/27.94 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/A2C
Number of trials: 1/1 (1 PENDING)
+-----------------------------+----------+-------+
| Trial name | status | loc |
|-----------------------------+----------+-------|
| A2C_CartPole-v0_ba413_00000 | PENDING | |
+-----------------------------+----------+-------+
2021-10-05 12:14:02,884 ERROR syncer.py:72 -- Log sync requires rsync to be installed.
(pid=4819) 2021-10-05 12:14:05,203 INFO trainer.py:726 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=4800) Full GCM network is: RayDenseGCM(
(pid=4800) (gcm): DenseGCM(
(pid=4800) (preprocessor): Linear(in_features=4, out_features=32, bias=True)
(pid=4800) (gnn): Sequential(
(pid=4800) (0): DenseGraphConv(32, 32)
(pid=4800) (1): Tanh()
(pid=4800) (2): DenseGraphConv(32, 32)
(pid=4800) (3): Tanh()
(pid=4800) )
(pid=4800) (edge_selectors): TemporalBackedge()
(pid=4800) )
(pid=4800) (logit_branch): SlimFC(
(pid=4800) (_model): Sequential(
(pid=4800) (0): Linear(in_features=32, out_features=2, bias=True)
(pid=4800) )
...
Using FIFO scheduling algorithm.
Resources requested: 0/96 CPUs, 0/4 GPUs, 0.0/116.42 GiB heap, 0.0/27.94 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/A2C
Number of trials: 1/1 (1 TERMINATED)
+-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------+
| Trial name | status | loc | iter | total time (s) | ts | reward | episode_reward_max | episode_reward_min | episode_len_mean |
|-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------|
| A2C_CartPole-v0_ba413_00000 | TERMINATED | | 1 | 0.563771 | 320 | 15.6667 | 21 | 10 | 15.6667 |
+-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------+
== Status ==
Memory usage on this node: 54.2/187.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/96 CPUs, 0/4 GPUs, 0.0/116.42 GiB heap, 0.0/27.94 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/A2C
Number of trials: 1/1 (1 TERMINATED)
+-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------+
| Trial name | status | loc | iter | total time (s) | ts | reward | episode_reward_max | episode_reward_min | episode_len_mean |
|-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------|
| A2C_CartPole-v0_ba413_00000 | TERMINATED | | 1 | 0.563771 | 320 | 15.6667 | 21 | 10 | 15.6667 |
+-----------------------------+------------+-------+--------+------------------+------+----------+----------------------+----------------------+--------------------+
2021-10-05 12:14:16,102 INFO tune.py:561 -- Total run time: 13.63 seconds (13.32 seconds for the tuning loop).
Your error message suggests some sort of pytorch/pytorch_geometric/ray issue. Are you executing ray
in anaconda or something? Is it possible that the worker processes do not have the same python libraries available as your parent process? This would explain why this works in local mode but not distributed mode.
Thank you @smorad. Yeah I actually modified ray_gcm.py so I can pass the edge_selector params from a config file.
# GCM Edge Selectors
from ray_graph_conv_memory.gcm.edge_selectors.temporal import TemporalBackedge
class RayDenseGCM(TorchModelV2, nn.Module):
EDGE_SELECTORS = {"TemporalBackedge": TemporalBackedge}
...
# Config Edge selectors
self.cfg["edge_selectors"] = self.EDGE_SELECTORS[custom_model_kwargs["edge_selectors"]](**custom_model_kwargs["edge_selectors_params"])
And I got the same error when I tried to revert my changes and run your example above.
Here is my package versions. And I noticed that I am using Torch geometric version 2.0.1 instead of 1.7.2 like yours. So I may try to downgrade my Torch geometric to see if this issue persist.
Ray version 1.6.0
Torch version 1.9.0+cu111
Torch geometric version 2.0.1
Python version sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
For your suggestion on package discrepancies between workers and parent, I pip installed ray into conda env and run my test in VScode Debugger.
I think another issue can be I hand picked your recent changes and copied to my project to fit my project's directory structure. So I will double check to ensure I have all the updated code.
Hi @smorad, okay that was it. The issue was Torch geometric version 2.0.1. After I downgraded to 1.7.2. The example works.
Thank you so much for your help @smorad!
Hi Team,
I am trying to reproduce the result with your Rllib example with Rllib 1.6.0. However, I am seeing serialization errors. Which RLlib version did your team use? And which python version you were running? Starting with Rllib 1.6.0, it requires to be running with Python 3.8+
Given Rllib example
Serialization errors