Open hxue3 opened 3 months ago
can you provide a full stacktrace when you see
I am trying to load a quantized large model with vLLM. It is able to start the model loading, but it sometimes will stop loading the model and returns the error message ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.
can you provide a full stacktrace when you see
I am trying to load a quantized large model with vLLM. It is able to start the model loading, but it sometimes will stop loading the model and returns the error message ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.
ray, version 2.34.0
/usr/bin/ssh ray hpcm04r08n03.hpc.ford.com 'ray start --address='19.62.140.84:6379''
ssh-keysign: no matching hostkey found for key ED25519 SHA256:vWBZ07sgIiNdjV05CoWXNtpeMzvzg/mHz2QSX4QWJSw
ssh_keysign: no reply
sign using hostkey ssh-ed25519 SHA256:vWBZ07sgIiNdjV05CoWXNtpeMzvzg/mHz2QSX4QWJSw failed
Permission denied, please try again.
Permission denied, please try again.
hxue3@hpcm04r08n03.hpc.ford.com: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,keyboard-interactive,hostbased).
2024-08-19 17:51:36,012 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/pbs.8474282.hpcq/ray/session_2024-08-19_17-51-20_670517_323/logs/ray-data
2024-08-19 17:51:36,012 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCSV] -> ActorPoolMapOperator[MapBatches(LLMPredictor)]
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:51:40 config.py:484] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:51:40 config.py:729] Defaulting to use ray for distributed inference
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:51:40 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8', speculative_config=None, tokenizer='/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=25000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fbgemm_fp8, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:51:40 ray_gpu_executor.py:117] use_ray_spmd_worker: False
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:51:40 ray_gpu_executor.py:120] driver_ip: 19.62.140.84
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:52:13 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:52:13 selector.py:54] Using XFormers backend.
[36m(_MapWorker pid=7279)[0m /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
[36m(_MapWorker pid=7279)[0m @torch.library.impl_abstract("xformers_flash::flash_fwd")
[36m(_MapWorker pid=7279)[0m @torch.library.impl_abstract("xformers_flash::flash_bwd")
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:52:22 utils.py:841] Found nccl from library libnccl.so.2
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:52:22 pynccl.py:63] vLLM is using nccl==2.20.5
[36m(RayWorkerWrapper pid=8266)[0m INFO 08-19 17:52:13 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.[32m [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m
[36m(RayWorkerWrapper pid=8266)[0m INFO 08-19 17:52:13 selector.py:54] Using XFormers backend.[32m [repeated 7x across cluster][0m
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:52:29 custom_all_reduce_utils.py:234] reading GPU P2P access cache from /s/hxue3/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
[36m(RayWorkerWrapper pid=8266)[0m INFO 08-19 17:52:22 utils.py:841] Found nccl from library libnccl.so.2[32m [repeated 7x across cluster][0m
[36m(RayWorkerWrapper pid=8266)[0m INFO 08-19 17:52:22 pynccl.py:63] vLLM is using nccl==2.20.5[32m [repeated 7x across cluster][0m
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:52:29 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fa97c98c880>, local_subscribe_port=60379, remote_subscribe_port=None)
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:52:29 model_runner.py:720] Starting to load model /s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8...
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:52:29 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
[36m(_MapWorker pid=7279)[0m INFO 08-19 17:52:29 selector.py:54] Using XFormers backend.
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 0% Completed | 0/109 [00:00<?, ?it/s]
[36m(RayWorkerWrapper pid=8266)[0m /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.[32m [repeated 15x across cluster][0m
[36m(RayWorkerWrapper pid=8266)[0m @torch.library.impl_abstract("xformers_flash::flash_fwd")[32m [repeated 7x across cluster][0m
[36m(RayWorkerWrapper pid=8266)[0m @torch.library.impl_abstract("xformers_flash::flash_bwd")[32m [repeated 7x across cluster][0m
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 1% Completed | 1/109 [00:04<07:43, 4.29s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 2% Completed | 2/109 [00:08<07:50, 4.40s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 3% Completed | 3/109 [00:12<07:28, 4.23s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 4% Completed | 4/109 [00:15<06:17, 3.60s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 5% Completed | 5/109 [00:17<05:35, 3.22s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 6% Completed | 6/109 [00:23<06:35, 3.84s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 6% Completed | 7/109 [00:27<06:50, 4.02s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 7% Completed | 8/109 [00:32<07:07, 4.23s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 8% Completed | 9/109 [00:36<07:01, 4.22s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 9% Completed | 10/109 [00:38<05:58, 3.62s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 10% Completed | 11/109 [00:41<05:25, 3.32s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 11% Completed | 12/109 [00:43<05:01, 3.11s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 12% Completed | 13/109 [00:44<03:44, 2.34s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 13% Completed | 14/109 [00:45<02:58, 1.88s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 14% Completed | 15/109 [00:46<02:38, 1.69s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 15% Completed | 16/109 [00:47<02:27, 1.59s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 16% Completed | 17/109 [00:48<02:08, 1.39s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 17% Completed | 18/109 [00:49<01:47, 1.18s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 17% Completed | 19/109 [00:49<01:21, 1.11it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 18% Completed | 20/109 [00:50<01:04, 1.38it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 19% Completed | 21/109 [00:50<00:52, 1.69it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 20% Completed | 22/109 [00:50<00:48, 1.79it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 21% Completed | 23/109 [00:51<00:41, 2.06it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 22% Completed | 24/109 [00:51<00:37, 2.30it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 23% Completed | 25/109 [00:51<00:35, 2.38it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 24% Completed | 26/109 [00:52<00:31, 2.67it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 25% Completed | 27/109 [00:52<00:29, 2.81it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 26% Completed | 28/109 [00:52<00:26, 3.01it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 27% Completed | 29/109 [00:53<00:27, 2.89it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 28% Completed | 30/109 [00:53<00:26, 2.95it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 28% Completed | 31/109 [00:53<00:25, 3.09it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 29% Completed | 32/109 [00:53<00:23, 3.30it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 30% Completed | 33/109 [00:54<00:22, 3.43it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 31% Completed | 34/109 [00:54<00:19, 3.84it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 32% Completed | 35/109 [00:54<00:19, 3.88it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 33% Completed | 36/109 [00:54<00:19, 3.80it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 34% Completed | 37/109 [00:55<00:18, 3.87it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 35% Completed | 38/109 [00:55<00:18, 3.91it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 36% Completed | 39/109 [00:55<00:17, 3.98it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 37% Completed | 40/109 [00:55<00:17, 4.04it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 38% Completed | 41/109 [00:56<00:17, 3.97it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 39% Completed | 42/109 [00:56<00:16, 4.02it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 39% Completed | 43/109 [00:56<00:17, 3.82it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 40% Completed | 44/109 [00:56<00:17, 3.82it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 41% Completed | 45/109 [00:59<00:59, 1.08it/s]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 42% Completed | 46/109 [01:09<03:55, 3.73s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 43% Completed | 47/109 [01:20<06:10, 5.98s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 44% Completed | 48/109 [01:33<08:08, 8.02s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 45% Completed | 49/109 [01:36<06:32, 6.54s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 46% Completed | 50/109 [01:48<07:50, 7.97s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 47% Completed | 51/109 [01:58<08:29, 8.78s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 48% Completed | 52/109 [02:02<07:01, 7.39s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 49% Completed | 53/109 [02:13<07:40, 8.22s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 50% Completed | 54/109 [02:23<08:15, 9.02s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 50% Completed | 55/109 [02:34<08:38, 9.59s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 51% Completed | 56/109 [02:47<09:09, 10.37s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 52% Completed | 57/109 [03:00<09:53, 11.42s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 53% Completed | 58/109 [03:14<10:15, 12.07s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 54% Completed | 59/109 [03:23<09:18, 11.16s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 55% Completed | 60/109 [03:34<08:58, 11.00s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 56% Completed | 61/109 [03:47<09:17, 11.62s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 57% Completed | 62/109 [03:57<08:40, 11.08s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 58% Completed | 63/109 [04:07<08:17, 10.81s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 59% Completed | 64/109 [04:17<08:00, 10.68s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 60% Completed | 65/109 [04:29<08:06, 11.05s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 61% Completed | 66/109 [04:41<08:09, 11.39s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 61% Completed | 67/109 [04:52<07:57, 11.36s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 62% Completed | 68/109 [05:03<07:35, 11.11s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 63% Completed | 69/109 [05:16<07:43, 11.58s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 64% Completed | 70/109 [05:27<07:29, 11.53s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 65% Completed | 71/109 [05:38<07:11, 11.36s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 66% Completed | 72/109 [05:50<07:09, 11.60s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 67% Completed | 73/109 [06:00<06:43, 11.20s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 68% Completed | 74/109 [06:13<06:47, 11.64s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 69% Completed | 75/109 [06:30<07:30, 13.24s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 70% Completed | 76/109 [06:45<07:30, 13.64s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 71% Completed | 77/109 [06:55<06:48, 12.78s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 72% Completed | 78/109 [07:05<06:05, 11.80s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 72% Completed | 79/109 [07:15<05:37, 11.24s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 73% Completed | 80/109 [07:29<05:51, 12.11s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 74% Completed | 81/109 [07:41<05:37, 12.07s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 75% Completed | 82/109 [07:52<05:16, 11.72s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 76% Completed | 83/109 [08:03<04:56, 11.39s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 77% Completed | 84/109 [08:12<04:29, 10.80s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 78% Completed | 85/109 [08:22<04:11, 10.50s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 79% Completed | 86/109 [08:34<04:10, 10.91s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 80% Completed | 87/109 [08:43<03:52, 10.56s/it]
[36m(_MapWorker pid=7279)[0m
Loading safetensors checkpoint shards: 81% Completed | 88/109 [08:55<03:51, 11.02s/it]
2024-08-19 18:01:36,067 ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2024-08-19 18:01:36,067 ERROR exceptions.py:81 -- Full stack trace:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 135, in start
ray.get(refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2659, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 848, in get_objects
data_metadata_pairs = self.core_worker.get_objects(
File "python/ray/_raylet.pyx", line 3511, in ray._raylet.CoreWorker.get_objects
File "python/ray/includes/common.pxi", line 81, in ray._raylet.check_status
ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/data/exceptions.py", line 49, in handle_trace
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/plan.py", line 423, in execute_to_iterator
bundle_iter = execute_to_legacy_bundle_iterator(executor, self)
File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/legacy_compat.py", line 51, in execute_to_legacy_bundle_iterator
bundle_iter = executor.execute(dag, initial_stats=stats)
File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/streaming_executor.py", line 114, in execute
self._topology, _ = build_streaming_topology(dag, self._options)
File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/streaming_executor_state.py", line 354, in build_streaming_topology
setup_state(dag)
File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/streaming_executor_state.py", line 351, in setup_state
op.start(options)
File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 137, in start
raise ray.exceptions.GetTimeoutError(
ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool.
ray.data.exceptions.SystemException
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/s/mlsc/hxue3/llama3-vllm/run_inference_quantized.py", line 99, in <module>
outputs = ds.take_all()
File "/usr/local/lib/python3.10/dist-packages/ray/data/dataset.py", line 2464, in take_all
for row in self.iter_rows():
File "/usr/local/lib/python3.10/dist-packages/ray/data/iterator.py", line 238, in _wrapped_iterator
for batch in batch_iterable:
File "/usr/local/lib/python3.10/dist-packages/ray/data/iterator.py", line 155, in _create_iterator
) = self._to_ref_bundle_iterator()
File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/iterator/iterator_impl.py", line 28, in _to_ref_bundle_iterator
ref_bundles_iterator, stats, executor = ds._plan.execute_to_iterator()
File "/usr/local/lib/python3.10/dist-packages/ray/data/exceptions.py", line 89, in handle_trace
raise e.with_traceback(None) from SystemException()
ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool.
[36m(RayWorkerWrapper pid=8266)[0m INFO 08-19 17:52:29 custom_all_reduce_utils.py:234] reading GPU P2P access cache from /s/hxue3/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json[32m [repeated 7x across cluster][0m
[36m(RayWorkerWrapper pid=8266)[0m INFO 08-19 17:52:29 model_runner.py:720] Starting to load model /s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8...[32m [repeated 7x across cluster][0m
[36m(RayWorkerWrapper pid=8266)[0m INFO 08-19 17:52:29 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.[32m [repeated 7x across cluster][0m
[36m(RayWorkerWrapper pid=8266)[0m INFO 08-19 17:52:29 selector.py:54] Using XFormers backend.[32m [repeated 7x across cluster][0m
@hxue3 this line...
ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool.
looks suspect - if you try and repro this with a brand new ray cluster what ray status
return when it times out?
This is an error related to the timeout of the actor. As the model size increases, the time required to download the model from Hugging Face and to load it into vLLM also increases. You can avoid the GetTimeoutError
by increasing the wait_for_min_actors_s value
in the DataContext
.
import ray
from ray.data import DataContext
# ray init
runtime_env = {"env_vars": {"HF_TOKEN": "__YOUR_HF_TOKEN__"}}
ray.init(runtime_env=runtime_env)
# data context
ctx = DataContext.get_current()
ctx.wait_for_min_actors_s = 60 * 10 * tensor_parallel_size
The wait_for_min_actors_s
value is set to 60 * 10
seconds and is multiplied by the tensor_parallel_size
.
If there are 8 GPUs, you can use up to 80 minutes for downloading the model and running vLLM.
cc @hxue3 as FYI - kudos @ggamsso !
@hxue3 lmk if this was fixed!
Im not the original poster but this worked for me as i was loading model from s3. it was taking more than 10mins for 70b+models. And increasing the timeout fixed the issue. Thanks!
Hi @rkooo567 !
Do you know if @ggamsso 's solution will work if I set this parameter in the driver code after ray.init
if my cluster is created using kuberay
? In general, if a cluster is created using kuberay
, should I set this parameter in the k8 config for creating the pod, or setting in the driver code would work?
Please let me know.
What happened + What you expected to happen
I am trying to load a quantized large model with vLLM. It is able to start the model loading, but it sometimes will stop loading the model and returns the error message
ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.
Versions / Dependencies
ray: 2.34.0 python: 3.10 OS: ubuntu 22
Reproduction script
Issue Severity
High: It blocks me from completing my task.