ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.21k stars 5.81k forks source link

[Core] ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready. #47183

Open hxue3 opened 3 months ago

hxue3 commented 3 months ago

What happened + What you expected to happen

I am trying to load a quantized large model with vLLM. It is able to start the model loading, but it sometimes will stop loading the model and returns the error message ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.

Versions / Dependencies

ray: 2.34.0 python: 3.10 OS: ubuntu 22

Reproduction script

# Create a class to do batch inference.
class LLMPredictor:

    def __init__(self):
        # Create an LLM.
        #ray.shutdown()

        #ray.init(num_gpus=torch.cuda.device_count())
        self.llm = LLM(model="/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8",
                       tensor_parallel_size=tensor_parallel_size,gpu_memory_utilization=0.95,max_model_len=32768, max_num_batched_tokens=32768)

    def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
        # Generate texts from the prompts.
        # The output is a list of RequestOutput objects that contain the prompt,
        # generated text, and other information.
        outputs = self.llm.generate(batch["prompts"], sampling_params)
        prompt: List[str] = []
        generated_text: List[str] = []
        for output in outputs:
            prompt.append(output.prompt)
            generated_text.append(' '.join([o.text for o in output.outputs]))
        return {
            "prompt": prompt,
            "generated_text": generated_text,
        }

ds = ray.data.read_csv("sample_prompts.csv",parse_options=parse_options)

# For tensor_parallel_size > 1, we need to create placement groups for vLLM
# to use. Every actor has to have its own placement group.
def scheduling_strategy_fn():
    # One bundle per tensor parallel worker
    pg = ray.util.placement_group(
        [{
            "GPU": 1,
            "CPU": 1
        }] * tensor_parallel_size,
        strategy="STRICT_PACK",
    )
    return dict(scheduling_strategy=PlacementGroupSchedulingStrategy(
        pg, placement_group_capture_child_tasks=True))

resources_kwarg: Dict[str, Any] = {}
if tensor_parallel_size == 1:
    # For tensor_parallel_size == 1, we simply set num_gpus=1.
    resources_kwarg["num_gpus"] = 1
else:
    # Otherwise, we have to set num_gpus=0 and provide
    # a function that will create a placement group for
    # each instance.
    resources_kwarg["num_gpus"] = 0
    resources_kwarg["ray_remote_args_fn"] = scheduling_strategy_fn

# Apply batch inference for all input data.
ds = ds.map_batches(
    LLMPredictor,
    # Set the concurrency to the number of LLM instances.
    concurrency=num_instances,
    # Specify the batch size for inference.
    batch_size=2,
    **resources_kwarg,
)

Issue Severity

High: It blocks me from completing my task.

rkooo567 commented 3 months ago

can you provide a full stacktrace when you see

I am trying to load a quantized large model with vLLM. It is able to start the model loading, but it sometimes will stop loading the model and returns the error message ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.
hxue3 commented 3 months ago

can you provide a full stacktrace when you see

I am trying to load a quantized large model with vLLM. It is able to start the model loading, but it sometimes will stop loading the model and returns the error message ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.
ray, version 2.34.0
/usr/bin/ssh ray hpcm04r08n03.hpc.ford.com 'ray start --address='19.62.140.84:6379''
ssh-keysign: no matching hostkey found for key ED25519 SHA256:vWBZ07sgIiNdjV05CoWXNtpeMzvzg/mHz2QSX4QWJSw
ssh_keysign: no reply
sign using hostkey ssh-ed25519 SHA256:vWBZ07sgIiNdjV05CoWXNtpeMzvzg/mHz2QSX4QWJSw failed
Permission denied, please try again.
Permission denied, please try again.
hxue3@hpcm04r08n03.hpc.ford.com: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,keyboard-interactive,hostbased).
2024-08-19 17:51:36,012 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/pbs.8474282.hpcq/ray/session_2024-08-19_17-51-20_670517_323/logs/ray-data
2024-08-19 17:51:36,012 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCSV] -> ActorPoolMapOperator[MapBatches(LLMPredictor)]
(_MapWorker pid=7279) INFO 08-19 17:51:40 config.py:484] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(_MapWorker pid=7279) INFO 08-19 17:51:40 config.py:729] Defaulting to use ray for distributed inference
(_MapWorker pid=7279) INFO 08-19 17:51:40 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8', speculative_config=None, tokenizer='/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=25000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fbgemm_fp8, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
(_MapWorker pid=7279) INFO 08-19 17:51:40 ray_gpu_executor.py:117] use_ray_spmd_worker: False
(_MapWorker pid=7279) INFO 08-19 17:51:40 ray_gpu_executor.py:120] driver_ip: 19.62.140.84
(_MapWorker pid=7279) INFO 08-19 17:52:13 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
(_MapWorker pid=7279) INFO 08-19 17:52:13 selector.py:54] Using XFormers backend.
(_MapWorker pid=7279) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(_MapWorker pid=7279)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(_MapWorker pid=7279)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(_MapWorker pid=7279) INFO 08-19 17:52:22 utils.py:841] Found nccl from library libnccl.so.2
(_MapWorker pid=7279) INFO 08-19 17:52:22 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=8266) INFO 08-19 17:52:13 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache. [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=8266) INFO 08-19 17:52:13 selector.py:54] Using XFormers backend. [repeated 7x across cluster]
(_MapWorker pid=7279) INFO 08-19 17:52:29 custom_all_reduce_utils.py:234] reading GPU P2P access cache from /s/hxue3/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(RayWorkerWrapper pid=8266) INFO 08-19 17:52:22 utils.py:841] Found nccl from library libnccl.so.2 [repeated 7x across cluster]
(RayWorkerWrapper pid=8266) INFO 08-19 17:52:22 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 7x across cluster]
(_MapWorker pid=7279) INFO 08-19 17:52:29 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fa97c98c880>, local_subscribe_port=60379, remote_subscribe_port=None)
(_MapWorker pid=7279) INFO 08-19 17:52:29 model_runner.py:720] Starting to load model /s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8...
(_MapWorker pid=7279) INFO 08-19 17:52:29 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache.
(_MapWorker pid=7279) INFO 08-19 17:52:29 selector.py:54] Using XFormers backend.
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   0% Completed | 0/109 [00:00<?, ?it/s]
(RayWorkerWrapper pid=8266) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. [repeated 15x across cluster]
(RayWorkerWrapper pid=8266)   @torch.library.impl_abstract("xformers_flash::flash_fwd") [repeated 7x across cluster]
(RayWorkerWrapper pid=8266)   @torch.library.impl_abstract("xformers_flash::flash_bwd") [repeated 7x across cluster]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   1% Completed | 1/109 [00:04<07:43,  4.29s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   2% Completed | 2/109 [00:08<07:50,  4.40s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   3% Completed | 3/109 [00:12<07:28,  4.23s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   4% Completed | 4/109 [00:15<06:17,  3.60s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   5% Completed | 5/109 [00:17<05:35,  3.22s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   6% Completed | 6/109 [00:23<06:35,  3.84s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   6% Completed | 7/109 [00:27<06:50,  4.02s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   7% Completed | 8/109 [00:32<07:07,  4.23s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   8% Completed | 9/109 [00:36<07:01,  4.22s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:   9% Completed | 10/109 [00:38<05:58,  3.62s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  10% Completed | 11/109 [00:41<05:25,  3.32s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  11% Completed | 12/109 [00:43<05:01,  3.11s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  12% Completed | 13/109 [00:44<03:44,  2.34s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  13% Completed | 14/109 [00:45<02:58,  1.88s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  14% Completed | 15/109 [00:46<02:38,  1.69s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  15% Completed | 16/109 [00:47<02:27,  1.59s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  16% Completed | 17/109 [00:48<02:08,  1.39s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  17% Completed | 18/109 [00:49<01:47,  1.18s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  17% Completed | 19/109 [00:49<01:21,  1.11it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  18% Completed | 20/109 [00:50<01:04,  1.38it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  19% Completed | 21/109 [00:50<00:52,  1.69it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  20% Completed | 22/109 [00:50<00:48,  1.79it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  21% Completed | 23/109 [00:51<00:41,  2.06it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  22% Completed | 24/109 [00:51<00:37,  2.30it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  23% Completed | 25/109 [00:51<00:35,  2.38it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  24% Completed | 26/109 [00:52<00:31,  2.67it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  25% Completed | 27/109 [00:52<00:29,  2.81it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  26% Completed | 28/109 [00:52<00:26,  3.01it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  27% Completed | 29/109 [00:53<00:27,  2.89it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  28% Completed | 30/109 [00:53<00:26,  2.95it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  28% Completed | 31/109 [00:53<00:25,  3.09it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  29% Completed | 32/109 [00:53<00:23,  3.30it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  30% Completed | 33/109 [00:54<00:22,  3.43it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  31% Completed | 34/109 [00:54<00:19,  3.84it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  32% Completed | 35/109 [00:54<00:19,  3.88it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  33% Completed | 36/109 [00:54<00:19,  3.80it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  34% Completed | 37/109 [00:55<00:18,  3.87it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  35% Completed | 38/109 [00:55<00:18,  3.91it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  36% Completed | 39/109 [00:55<00:17,  3.98it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  37% Completed | 40/109 [00:55<00:17,  4.04it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  38% Completed | 41/109 [00:56<00:17,  3.97it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  39% Completed | 42/109 [00:56<00:16,  4.02it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  39% Completed | 43/109 [00:56<00:17,  3.82it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  40% Completed | 44/109 [00:56<00:17,  3.82it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  41% Completed | 45/109 [00:59<00:59,  1.08it/s]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  42% Completed | 46/109 [01:09<03:55,  3.73s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  43% Completed | 47/109 [01:20<06:10,  5.98s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  44% Completed | 48/109 [01:33<08:08,  8.02s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  45% Completed | 49/109 [01:36<06:32,  6.54s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  46% Completed | 50/109 [01:48<07:50,  7.97s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  47% Completed | 51/109 [01:58<08:29,  8.78s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  48% Completed | 52/109 [02:02<07:01,  7.39s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  49% Completed | 53/109 [02:13<07:40,  8.22s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  50% Completed | 54/109 [02:23<08:15,  9.02s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  50% Completed | 55/109 [02:34<08:38,  9.59s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  51% Completed | 56/109 [02:47<09:09, 10.37s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  52% Completed | 57/109 [03:00<09:53, 11.42s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  53% Completed | 58/109 [03:14<10:15, 12.07s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  54% Completed | 59/109 [03:23<09:18, 11.16s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  55% Completed | 60/109 [03:34<08:58, 11.00s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  56% Completed | 61/109 [03:47<09:17, 11.62s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  57% Completed | 62/109 [03:57<08:40, 11.08s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  58% Completed | 63/109 [04:07<08:17, 10.81s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  59% Completed | 64/109 [04:17<08:00, 10.68s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  60% Completed | 65/109 [04:29<08:06, 11.05s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  61% Completed | 66/109 [04:41<08:09, 11.39s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  61% Completed | 67/109 [04:52<07:57, 11.36s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  62% Completed | 68/109 [05:03<07:35, 11.11s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  63% Completed | 69/109 [05:16<07:43, 11.58s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  64% Completed | 70/109 [05:27<07:29, 11.53s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  65% Completed | 71/109 [05:38<07:11, 11.36s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  66% Completed | 72/109 [05:50<07:09, 11.60s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  67% Completed | 73/109 [06:00<06:43, 11.20s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  68% Completed | 74/109 [06:13<06:47, 11.64s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  69% Completed | 75/109 [06:30<07:30, 13.24s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  70% Completed | 76/109 [06:45<07:30, 13.64s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  71% Completed | 77/109 [06:55<06:48, 12.78s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  72% Completed | 78/109 [07:05<06:05, 11.80s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  72% Completed | 79/109 [07:15<05:37, 11.24s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  73% Completed | 80/109 [07:29<05:51, 12.11s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  74% Completed | 81/109 [07:41<05:37, 12.07s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  75% Completed | 82/109 [07:52<05:16, 11.72s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  76% Completed | 83/109 [08:03<04:56, 11.39s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  77% Completed | 84/109 [08:12<04:29, 10.80s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  78% Completed | 85/109 [08:22<04:11, 10.50s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  79% Completed | 86/109 [08:34<04:10, 10.91s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  80% Completed | 87/109 [08:43<03:52, 10.56s/it]
(_MapWorker pid=7279) 
Loading safetensors checkpoint shards:  81% Completed | 88/109 [08:55<03:51, 11.02s/it]
2024-08-19 18:01:36,067 ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2024-08-19 18:01:36,067 ERROR exceptions.py:81 -- Full stack trace:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 135, in start
    ray.get(refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2659, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 848, in get_objects
    data_metadata_pairs = self.core_worker.get_objects(
  File "python/ray/_raylet.pyx", line 3511, in ray._raylet.CoreWorker.get_objects
  File "python/ray/includes/common.pxi", line 81, in ray._raylet.check_status
ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/data/exceptions.py", line 49, in handle_trace
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/plan.py", line 423, in execute_to_iterator
    bundle_iter = execute_to_legacy_bundle_iterator(executor, self)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/legacy_compat.py", line 51, in execute_to_legacy_bundle_iterator
    bundle_iter = executor.execute(dag, initial_stats=stats)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/streaming_executor.py", line 114, in execute
    self._topology, _ = build_streaming_topology(dag, self._options)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/streaming_executor_state.py", line 354, in build_streaming_topology
    setup_state(dag)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/streaming_executor_state.py", line 351, in setup_state
    op.start(options)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 137, in start
    raise ray.exceptions.GetTimeoutError(
ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool.
ray.data.exceptions.SystemException

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/s/mlsc/hxue3/llama3-vllm/run_inference_quantized.py", line 99, in <module>
    outputs = ds.take_all()
  File "/usr/local/lib/python3.10/dist-packages/ray/data/dataset.py", line 2464, in take_all
    for row in self.iter_rows():
  File "/usr/local/lib/python3.10/dist-packages/ray/data/iterator.py", line 238, in _wrapped_iterator
    for batch in batch_iterable:
  File "/usr/local/lib/python3.10/dist-packages/ray/data/iterator.py", line 155, in _create_iterator
    ) = self._to_ref_bundle_iterator()
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/iterator/iterator_impl.py", line 28, in _to_ref_bundle_iterator
    ref_bundles_iterator, stats, executor = ds._plan.execute_to_iterator()
  File "/usr/local/lib/python3.10/dist-packages/ray/data/exceptions.py", line 89, in handle_trace
    raise e.with_traceback(None) from SystemException()
ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool.
(RayWorkerWrapper pid=8266) INFO 08-19 17:52:29 custom_all_reduce_utils.py:234] reading GPU P2P access cache from /s/hxue3/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 7x across cluster]
(RayWorkerWrapper pid=8266) INFO 08-19 17:52:29 model_runner.py:720] Starting to load model /s/hpc-datasets/models/llama-3/Meta-Llama-3.1-405B-Instruct-FP8... [repeated 7x across cluster]
(RayWorkerWrapper pid=8266) INFO 08-19 17:52:29 selector.py:161] Cannot use FlashAttention-2 backend for FP8 KV cache. [repeated 7x across cluster]
(RayWorkerWrapper pid=8266) INFO 08-19 17:52:29 selector.py:54] Using XFormers backend. [repeated 7x across cluster]
anyscalesam commented 3 months ago

@hxue3 this line... ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool. looks suspect - if you try and repro this with a brand new ray cluster what ray status return when it times out?

ggamsso commented 2 months ago

This is an error related to the timeout of the actor. As the model size increases, the time required to download the model from Hugging Face and to load it into vLLM also increases. You can avoid the GetTimeoutError by increasing the wait_for_min_actors_s value in the DataContext.

import ray
from ray.data import DataContext

# ray init
runtime_env = {"env_vars": {"HF_TOKEN": "__YOUR_HF_TOKEN__"}}
ray.init(runtime_env=runtime_env)

# data context
ctx = DataContext.get_current()
ctx.wait_for_min_actors_s = 60 * 10 * tensor_parallel_size

The wait_for_min_actors_s value is set to 60 * 10 seconds and is multiplied by the tensor_parallel_size. If there are 8 GPUs, you can use up to 80 minutes for downloading the model and running vLLM.

anyscalesam commented 2 months ago

cc @hxue3 as FYI - kudos @ggamsso !

rkooo567 commented 2 months ago

@hxue3 lmk if this was fixed!

nivibilla commented 1 month ago

Im not the original poster but this worked for me as i was loading model from s3. it was taking more than 10mins for 70b+models. And increasing the timeout fixed the issue. Thanks!

dhananjaisharma10 commented 2 weeks ago

Hi @rkooo567 !

Do you know if @ggamsso 's solution will work if I set this parameter in the driver code after ray.init if my cluster is created using kuberay? In general, if a cluster is created using kuberay, should I set this parameter in the k8 config for creating the pod, or setting in the driver code would work?

Please let me know.