ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.13k stars 5.61k forks source link

[Bug] Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup. #19834

Closed JackonLiu closed 1 year ago

JackonLiu commented 2 years ago

Search before asking

Ray Component

Ray Tune

What happened + What you expected to happen

2021-10-28 18:01:24,117 INFO services.py:1255 -- View the Ray dashboard at http://127.0.0.1:8265 Traceback (most recent call last): File "E:\software\conda\lib\site-packages\ray\node.py", line 265, in init self.redis_password) File "E:\software\conda\lib\site-packages\ray_private\services.py", line 276, in wait_for_node raise TimeoutError("Timed out while waiting for node to startup.") TimeoutError: Timed out while waiting for node to startup.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "E:\software\PyCharm 2020.2.5\plugins\python\helpers\pydev\pydevd.py", line 1448, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "E:\software\PyCharm 2020.2.5\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "E:/document/JacksonProject/wrap_angle/angle_ray.py", line 7, in ray.init() File "E:\software\conda\lib\site-packages\ray_private\client_mode_hook.py", line 89, in wrapper return func(*args, **kwargs) File "E:\software\conda\lib\site-packages\ray\worker.py", line 897, in init ray_params=ray_params) File "E:\software\conda\lib\site-packages\ray\node.py", line 268, in init "The current node has not been updated within 30 " Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup. [0x7FF97CFAE0A4] ANOMALY: use of REX.w is meaningless (default operand size is 64)

Versions / Dependencies

Name: ray Version: 1.7.1 Summary: Ray provides a simple, universal API for building distributed applications. Home-page: https://github.com/ray-project/ray Name: numpy Version: 1.19.5

windows10 anaconda python3.7

Reproduction script

import ray from ray.tune import register_trainable, run_experiments

import numpy as np from ray.tune.utils import pin_in_object_store, get_pinned_object

ray.init()

X_id can be referenced in closures

X_id = pin_in_object_store(np.random.random(size=100000000))

def f(config, reporter): X = get_pinned_object(X_id)

use X

register_trainable("f", f) run_experiments(...)

Anything else

每次都会发生

Are you willing to submit a PR?

keyboardAnt commented 2 years ago

I just got the same cryptic exception when trying to init Ray on CentOS.

Traceback

Traceback (most recent call last):
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/node.py", line 238, in __init__
    ray._private.services.wait_for_node(
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/_private/services.py", line 324, in wait_for_node
    raise TimeoutError("Timed out while waiting for node to startup.")
TimeoutError: Timed out while waiting for node to startup.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/.../ir-erank-2021/ir/__main__.py", line 90, in <module>
    main()
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/knockknock/slack_sender.py", line 105, in wrapper_sender
    raise ex
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/knockknock/slack_sender.py", line 63, in wrapper_sender
    value = func(*args, **kwargs)
  File "/home/.../ir-erank-2021/ir/__main__.py", line 54, in main
    ray.init(local_mode=hyperparams.parser_args.ray_local_mode)
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/worker.py", line 908, in init
    _global_node = ray.node.Node(
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/node.py", line 242, in __init__
    raise Exception(
Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.

OS

$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
outdoteth commented 2 years ago

This happens to me whenever I try to update an exsting node with: ray up -y my_config.yaml.

The head node updates fine, but any worker nodes shutdown and restart completely which takes a lot of tiime.

LuisFelipeLeivaH commented 2 years ago

This happens to me when trying to do ray.init() on a HPC computer cluster on computecanada.

EricCousineau-TRI commented 2 years ago

Happening to myself as well, more-or-less vanilla cluster setup w/ AWS EC2 (but on private subnet).

Using cached instance: https://docs.ray.io/en/releases-1.9.2/cluster/config.html#cluster-configuration-cache-stopped-nodes

Has anyone figured out what log info to look to get more details? I'd consider inspecting worker nodes, but hard to do when ray auto-stops the instance.

EricCousineau-TRI commented 2 years ago

I used workaround to inspect pre-shutdown logs after restarting the node: https://github.com/ray-project/ray/issues/22707#issuecomment-1054675306

When looking through, I see two types of logs - one that shows things being OK, then one showing things are NOT OK: https://gist.github.com/EricCousineau-TRI/4822b8be94fccc7483a51040e7f44d47

The main stuff from failing node is gRPC failing:

grpc_server.cc:102:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. [...] Try running lsof -i :8076 to check if there are other processes listening to the port.

However, I can't run lsof -i :8076 because ray has already shut down the node :(

EDIT: I can reproduce by re-running ray up <config_file>, and it seems to happen when restarting the worker's ray. I don't understand why, though, since we explicitly add ray stop as is the default: https://docs.ray.io/en/releases-1.9.2/cluster/config.html#cluster-configuration-worker-start-ray-commands

EricCousineau-TRI commented 2 years ago

K, so if I ensure I only call ray up once, then I do not get this issue. If I call it twice or more, then I get the port conflict on the worker.

However, if I cal ray up --no-restart, then it's fine.

But, this isn't great if I want to manually start a worker node, and then have ray use it. I also expect ray up to be idempotent - especially if ray stop is explicitly in the worker's start commands.

Are these the right expectations?

And more important to this issue - is this what any of y'all are experiencing as well?

robertreaney commented 2 years ago

This happens to me when trying to do ray.init() on a HPC computer cluster on computecanada.

I'm getting the same error on HPC.

AdamYoung71 commented 2 years ago

Getting the same error on HPC, can't start ray by: ray start --head The error: raise TimeoutError("Timed out while waiting for node to startup.

utkarshp commented 2 years ago

Getting the same error on Debian when trying to run without a GPU. Works fine in an identical environment (pytorch with cpuonly) on a machine that does have a GPU.

lundybernard commented 2 years ago

same error on docker-compose, latest rayproject/ray docker image, command: ray start -v --head ...

rkooo567 commented 2 years ago

@lundybernard is this also on Windows?

lundybernard commented 2 years ago

@lundybernard is this also on Windows?

no, this is on Macintosh M1 OSX, the container is running on platform: linux/amd64

eromoe commented 1 year ago

@rkooo567 Got this error on windows too.

mattip commented 1 year ago

Could someone provide a clear reproducer and description of the hardware/software stack?

1121091694 commented 1 year ago

TimeoutError Traceback (most recent call last) File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\node.py:312, in Node.init(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only) 311 try: --> 312 ray._private.services.wait_for_node( 313 self.redis_address, 314 self.gcs_address, 315 self._plasma_store_socket_name, 316 self.redis_password, 317 ) 318 except TimeoutError:

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\services.py:438, in wait_for_node(redis_address, gcs_address, node_plasma_store_socket_name, redis_password, timeout) 437 time.sleep(0.1) --> 438 raise TimeoutError("Timed out while waiting for node to startup.")

TimeoutError: Timed out while waiting for node to startup.

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last) Input In [2], in <cell line: 7>() 5 config = PPOConfig().training(gamma=0.9, lr=0.01, kl_coeff=0.3).resources(num_gpus=0).rollouts(num_rollout_workers=1) 6 print(config.to_dict()) ----> 7 algo = config.build(env="CartPole-v1")

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm_config.py:471, in AlgorithmConfig.build(self, env, logger_creator, use_copy) 468 if logger_creator is not None: 469 self.logger_creator = logger_creator --> 471 return self.algo_class( 472 config=self if not use_copy else copy.deepcopy(self), 473 logger_creator=self.logger_creator, 474 )

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py:424, in Algorithm.init(self, config, env, logger_creator, kwargs) 412 # Initialize common evaluation_metrics to nan, before they become 413 # available. We want to make sure the metrics are always present 414 # (although their values may be nan), so that Tune does not complain 415 # when we use these as stopping criteria. 416 self.evaluation_metrics = { 417 "evaluation": { 418 "episode_reward_max": np.nan, (...) 421 } 422 } --> 424 super().init( 425 config=config, 426 logger_creator=logger_creator, 427 kwargs, 428 ) 430 # Check, whether training_iteration is still a tune.Trainable property 431 # and has not been overridden by the user in the attempt to implement the 432 # algos logic (this should be done now inside training_step). 433 try:

File E:\Anaconda\envs\rllib\lib\site-packages\ray\tune\trainable\trainable.py:167, in Trainable.init(self, config, logger_creator, remote_checkpoint_dir, custom_syncer, sync_timeout) 165 start_time = time.time() 166 self._local_ip = ray.util.get_node_ip_address() --> 167 self.setup(copy.deepcopy(self.config)) 168 setup_time = time.time() - start_time 169 if setup_time > SETUP_TIME_THRESHOLD:

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py:542, in Algorithm.setup(self, config) 535 if _init is False: 536 # - Create rollout workers here automatically. 537 # - Run the execution plan to create the local iterator to next() 538 # in each training iteration. 539 # This matches the behavior of using build_trainer(), which 540 # has been deprecated. 541 try: --> 542 self.workers = WorkerSet( 543 env_creator=self.env_creator, 544 validate_env=self.validate_env, 545 default_policy_class=self.get_default_policy_class(self.config), 546 config=self.config, 547 num_workers=self.config["num_workers"], 548 local_worker=True, 549 logdir=self.logdir, 550 ) 551 # WorkerSet creation possibly fails, if some (remote) workers cannot 552 # be initialized properly (due to some errors in the RolloutWorker's 553 # constructor). 554 except RayActorError as e: 555 # In case of an actor (remote worker) init failure, the remote worker 556 # may still exist and will be accessible, however, e.g. calling 557 # its sample.remote() would result in strange "property not found" 558 # errors.

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:151, in WorkerSet.init(self, env_creator, validate_env, default_policy_class, config, num_workers, local_worker, logdir, _setup, policy_class, trainer_config) 149 # Create a number of @ray.remote workers. 150 self._remote_workers = [] --> 151 self.add_workers( 152 num_workers, 153 validate=config.validate_workers_after_construction, 154 ) 156 # Create a local worker, if needed. 157 # If num_workers > 0 and we don't have an env on the local worker, 158 # get the observation- and action spaces for each policy from 159 # the first remote worker (which does have an env). 160 if ( 161 local_worker 162 and self._remote_workers 163 and not config.create_env_on_local_worker 164 and (not config.observation_space or not config.action_space) 165 ):

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:474, in WorkerSet.add_workers(self, num_workers, validate) 457 """Creates and adds a number of remote workers to this worker set. 458 459 Can be called several times on the same WorkerSet to add more (...) 470 properly. 471 """ 472 old_num_workers = len(self._remote_workers) 473 self._remote_workers.extend( --> 474 [ 475 self._make_worker( 476 cls=self._cls, 477 env_creator=self._env_creator, 478 validate_env=None, 479 worker_index=old_num_workers + i + 1, 480 num_workers=old_num_workers + num_workers, 481 config=self._remote_config, 482 ) 483 for i in range(num_workers) 484 ] 485 ) 487 # Validate here, whether all remote workers have been constructed properly 488 # and are "up and running". If not, the following will throw a RayError 489 # which needs to be handled by this WorkerSet's owner (usually 490 # a RLlib Algorithm instance). 491 if validate:

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:475, in (.0) 457 """Creates and adds a number of remote workers to this worker set. 458 459 Can be called several times on the same WorkerSet to add more (...) 470 properly. 471 """ 472 old_num_workers = len(self._remote_workers) 473 self._remote_workers.extend( 474 [ --> 475 self._make_worker( 476 cls=self._cls, 477 env_creator=self._env_creator, 478 validate_env=None, 479 worker_index=old_num_workers + i + 1, 480 num_workers=old_num_workers + num_workers, 481 config=self._remote_config, 482 ) 483 for i in range(num_workers) 484 ] 485 ) 487 # Validate here, whether all remote workers have been constructed properly 488 # and are "up and running". If not, the following will throw a RayError 489 # which needs to be handled by this WorkerSet's owner (usually 490 # a RLlib Algorithm instance). 491 if validate:

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:785, in WorkerSet._make_worker(self, cls, env_creator, validate_env, worker_index, num_workers, recreated_worker, config, spaces) 782 logger.debug("Creating TF session {}".format(config["tf_session_args"])) 783 return tf1.Session(config=tf1.ConfigProto(**config["tf_session_args"])) --> 785 worker = cls( 786 env_creator=env_creator, 787 validate_env=validate_env, 788 default_policy_class=self._policy_class, 789 tf_session_creator=(session_creator if config["tf_session_args"] else None), 790 config=config, 791 worker_index=worker_index, 792 num_workers=num_workers, 793 recreated_worker=recreated_worker, 794 log_dir=self._logdir, 795 spaces=spaces, 796 dataset_shards=self._ds_shards, 797 ) 799 return worker

File E:\Anaconda\envs\rllib\lib\site-packages\ray\actor.py:529, in ActorClass.remote(self, *args, *kwargs) 517 def remote(self, args, kwargs): 518 """Create an actor. 519 520 Args: (...) 527 A handle to the newly created actor. 528 """ --> 529 return self._remote(args=args, kwargs=kwargs, self._default_options)

File E:\Anaconda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py:387, in _tracing_actor_creation.._invocation_actor_class_remote_span(self, args, kwargs, *_args, *_kwargs) 385 if not _is_tracing_enabled(): 386 assert "_ray_trace_ctx" not in kwargs --> 387 return method(self, args, kwargs, _args, **_kwargs) 389 class_name = self.__ray_metadata.class_name 390 method_name = "init__"

File E:\Anaconda\envs\rllib\lib\site-packages\ray\actor.py:764, in ActorClass._remote(self, args, kwargs, actor_options) 761 if actor_options.get("max_concurrency") is None: 762 actor_options["max_concurrency"] = 1000 if is_asyncio else 1 --> 764 if client_mode_should_convert(auto_init=True): 765 return client_mode_convert_actor(self, args, kwargs, actor_options) 767 # fill actor required options

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\client_mode_hook.py:124, in client_mode_should_convert(auto_init) 118 import ray 120 if ( 121 os.environ.get("RAY_ENABLE_AUTO_CONNECT", "") != "0" 122 and not ray.is_initialized() 123 ): --> 124 ray.init() 126 # is_client_mode_enabled_by_default is used for testing with 127 # RAY_CLIENT_MODE=1. This flag means all tests run with client mode. 128 return ( 129 is_client_mode_enabled or is_client_mode_enabled_by_default 130 ) and _get_client_hook_status_on_thread()

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\client_mode_hook.py:105, in client_mode_hook..wrapper(*args, kwargs) 103 if func.name != "init" or is_client_mode_enabled_by_default: 104 return getattr(ray, func.name)(*args, *kwargs) --> 105 return func(args, kwargs)

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\worker.py:1428, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs) 1386 ray_params = ray._private.parameter.RayParams( 1387 node_ip_address=node_ip_address, 1388 raylet_ip_address=raylet_ip_address, (...) 1422 node_name=_node_name, 1423 ) 1424 # Start the Ray processes. We set shutdown_at_exit=False because we 1425 # shutdown the node in the ray.shutdown call that happens in the atexit 1426 # handler. We still spawn a reaper process in case the atexit handler 1427 # isn't called. -> 1428 _global_node = ray._private.node.Node( 1429 head=True, shutdown_at_exit=False, spawn_reaper=True, ray_params=ray_params 1430 ) 1431 else: 1432 # In this case, we are connecting to an existing cluster. 1433 if num_cpus is not None or num_gpus is not None:

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\node.py:319, in Node.init(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only) 312 ray._private.services.wait_for_node( 313 self.redis_address, 314 self.gcs_address, 315 self._plasma_store_socket_name, 316 self.redis_password, 317 ) 318 except TimeoutError: --> 319 raise Exception( 320 "The current node has not been updated within 30 " 321 "seconds, this could happen because of some of " 322 "the Ray processes failed to startup." 323 ) 324 node_info = ray._private.services.get_node_to_connect_for_driver( 325 self.redis_address, 326 self.gcs_address, 327 self._raylet_ip_address, 328 redis_password=self.redis_password, 329 ) 330 if self._ray_params.node_manager_port == 0:

Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.

MY Env: windows:ray '3.0.0.dev0' just code: ray.init()

how can I solve the problem on windows, thanks

jzxycsjzy commented 1 year ago

That thing might happen when there're some errors in your last time running. And the error causes your ray node to start but not be stopped. I met the same problem, and I use the code below to figure it out.

ray.shutdown()

ray.init()

Then it does work.

mattip commented 1 year ago

@1121091694 could you give more information about your environment (where did you get python, do you have a NVidia GPU as well as a CPU, which exact version of nightly are you using)? It seems you are using a nightly (3.0.0.dev0), does the latest official release also fail?

mattip commented 1 year ago

I think we should close this. We have not gotten a complete report from a user that hits this:

Instead, we keep getting partial reports

Nikita-Dudorov commented 1 year ago

This happens to me when trying to do ray.init() on a HPC computer cluster on computecanada.

Did you manage to resolve it? Any way to run ray on computecanada ?

jpgard commented 1 year ago

I'm still experiencing this issue. It is probably due to a large number of jobs that are running/have run on a slurm cluster, but no way to debug it further. There isn't any information in the logs afaict.

Anyone able to work around this somehow?

To the developers: sorry, providing a reproducible example for this is pretty difficult. But, I am on ray 2.2 with rocky linux 8.5.

Update: in my case, sometimes just trying to .init() again solves the problem.

vertfreeber commented 1 year ago

(Running on Windows 10)

I think I found a solution but I dont know If I should laugh or cry right now....

So basically I had the same error messages after using ray.init() or ray start --head: "Timed out after 60 seconds while waiting for node to startup. Did not find socket "socket name" in the list of object store socket names" and "The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup."

I had no clue what this means and google + chatgpt had no answers that worked for me. I decided to find it myself and I wasted alot of time debugging the ray code in my IDE and sifting through the logs, trying to understand whats happening exactly, hoping to find the error. While looking trough the logs I found this in raylet.err:

"[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:581] String field 'ray.rpc.GcsNodeInfo.node_manager_hostname' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes. "

I skipped it first because I didnt unterstand it at first and thought maybe it is setting the hostname to None or something because it didnt find a node. But after several hours of trying something else and getting desperate I luckily came back to this error and thought to myself: " Hey, what could they mean with _node_managerhostname?"

And then it hit me.

I instantly hit the windows key, opened my system settings and went to the info tab. And there it was, the root of my problems:

"Device name: der_gerät"

The stupid name I gave my PC stopped me from using ray and just made me debug for atleast 4 hours over multiple days. I dont know why, but I guess the letter ä is not in UTF-8 haha.

After changing the name of my PC to something without weird letters of the german language, ray.init() started to work, finally. I hope it helps someone else, because im sure will I tell my colleagues (or in my case my fellow cs student friends) about this stupid bug.

Cheers! 😄

tongjingqi commented 3 months ago

我的报错是: sampleing ===== SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=0.0, top_p=1, top_k=-1, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['Question:', 'Question', 'USER:', 'USER', 'ASSISTANT:', 'ASSISTANT', 'Instruction:', 'Instruction', 'Response:', 'Response'], ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True) Traceback (most recent call last): File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/node.py", line 318, in init ray._private.services.wait_for_node( File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/services.py", line 464, in wait_for_node raise TimeoutError( TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2024-06-06_11-42-53_463432_8501/sockets/plasma_store in the list of object store socket names.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/mnt/data/user/zhao_jun/MetaMath/eval/eval_GSM8K_category.py", line 134, in gsm8k_test(model=args.model, data_path=args.data_file, start=args.start, end=args.end, batch_size=args.batch_size, tensor_parallel_size=args.tensor_parallel_size) File "/mnt/data/user/zhao_jun/MetaMath/eval/eval_GSM8K_category.py", line 92, in gsm8k_test llm = LLM(model=model,tensor_parallel_size=tensor_parallel_size) File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 228, in from_engine_args distributed_init_method, placement_group = initialize_cluster( File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/vllm/engine/ray_utils.py", line 77, in initialize_cluster ray.init(address=ray_address, ignore_reinit_error=True) File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/worker.py", line 1645, in init _global_node = ray._private.node.Node( File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/node.py", line 323, in init raise Exception( Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.

排查了很长时间发现是因为磁盘满了,没法生成调用ray所必要的中间文件,后来把磁盘清理之后就解决了该问题