ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

Received message larger than max (105683136 vs. 104857600) #24286

Open allendred opened 2 years ago

allendred commented 2 years ago

What happened + What you expected to happen

At Reproduction script, if config["num_workers"] not 0, error will appear

Traceback (most recent call last): File "ray_test.py", line 135, in trainer = sac.SACTrainer(config=config, env="my_env-v0") File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/rllib/agents/sac/sac.py", line 192, in init super().init(*args, kwargs) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 831, in init config, logger_creator, remote_checkpoint_dir, sync_function_tpl File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/tune/trainable.py", line 149, in init self.setup(copy.deepcopy(self.config)) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 918, in setup logdir=self.logdir, File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py", line 119, in init self.add_workers(num_workers) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py", line 242, in add_workers for i in range(num_workers) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py", line 242, in for i in range(num_workers) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py", line 608, in _make_worker disable_env_checking=config["disable_env_checking"], File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/actor.py", line 540, in remote return self._remote(args=args, kwargs=kwargs) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/util/tracing/tracing_helper.py", line 383, in _invocation_actor_class_remote_span return method(self, args, kwargs, *_args, *_kwargs) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/actor.py", line 743, in _remote if client_mode_should_convert(auto_init=True): File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 124, in client_mode_should_convert ray.init() File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(args, kwargs) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/worker.py", line 1100, in init hook() File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/tune/registry.py", line 191, in flush_values _make_key(self._prefix, category, key), value, overwrite=True File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, *kwargs) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/experimental/internal_kv.py", line 88, in _internal_kv_put return global_gcs_client.internal_kv_put(key, value, overwrite, namespace) == 0 File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/_private/gcs_utils.py", line 104, in wrapper return f(self, args, **kwargs) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/_private/gcs_utils.py", line 195, in internal_kv_put reply = self._kv_stub.InternalKVPut(req) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/grpc/_channel.py", line 946, in call return _end_unary_response_blocking(state, call, False, None) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.RESOURCE_EXHAUSTED details = "Received message larger than max (105683136 vs. 104857600)" debug_error_string = "{"created":"@1651138858.283821451","description":"Error received from peer ipv4:192.168.83.225:49167","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Received message larger than max (105683136 vs. 104857600)","grpc_status":8}"

Versions / Dependencies

ray==1.12.0 ray[rllib]

Reproduction script

import gym
import ray.rllib.agents.sac as sac
from ray.rllib.agents.sac import SACTrainer
from ray_env  import Grid
from ray.tune.registry import register_env

def env_creator(env_config):
    return Grid({'env':raw_env})
config = sac.DEFAULT_CONFIG.copy()

#config["num_gpus"] = 0
config["num_workers"] = 1
config['framework'] = 'torch'
register_env("my_env-v0", env_creator)
trainer = sac.SACTrainer(config=config, env="my_env-v0")

Issue Severity

High: It blocks me from completing my task.

xwjiang2010 commented 2 years ago

From the stack trace, it seems some large object is passed when remote worker is created - causing grpc resource exhausted error. This is consistent with your observation that this only happens when "config["num_workers"] not 0".

Do you have a env_creator that I can just plug in and run? The current one complains about ray_env.

allendred commented 2 years ago

From the stack trace, it seems some large object is passed when remote worker is created - causing grpc resource exhausted error. This is consistent with your observation that this only happens when "config["num_workers"] not 0".

Do you have a env_creator that I can just plug in and run? The current one complains about ray_env. Sorry about that,custom ray_env is confidential I tried the example and it works. I'm on another version of ray==1.6.0, on another server with no issues, but on this server tried the following error File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/rllib/models/preprocessors.py", line 187, in transform self.check_shape(observation) File "/home/gnn/conda/envs/gnn/lib/python3.6/site-packages/ray/rllib/models/preprocessors.py", line 68, in check_shape observation, self._obs_space) ValueError: ('Observation ({}) outside given space ({})!', array([0.]), Box([0.], [20.], (1,), float32))