vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.16k stars 3.83k forks source link

[Usage]: Error with Multi Node llama 405B inference #6938

Closed nivibilla closed 2 weeks ago

nivibilla commented 1 month ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1064-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G

Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             48
On-line CPU(s) list:                0-47
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R32
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           5599.74
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          768 KiB (24 instances)
L1i cache:                          768 KiB (24 instances)
L2 cache:                           12 MiB (24 instances)
L3 cache:                           96 MiB (6 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-47
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] optree==0.12.1
[pip3] sentence-transformers==2.7.0
[pip3] torch==2.3.1+cu121
[pip3] torcheval==0.0.7
[pip3] torchvision==0.18.1+cu121
[pip3] transformers==4.43.3
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  PHB PHB PHB 0-47    0       N/A
GPU1    PHB  X  PHB PHB 0-47    0       N/A
GPU2    PHB PHB  X  PHB 0-47    0       N/A
GPU3    PHB PHB PHB  X  0-47    0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

I am trying to run a multi node server of llama 3.1 405B across 16 nodes each having 4 A10 GPUs on databricks. I first start a ray instance and then try to run a server with tp size 64.

the driver is also of the same type g5.12x (4xA10)

Getting this error:

INFO 07-30 09:23:58 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 07-30 09:23:58 api_server.py:220] args: Namespace(model_tag='/dbfs/mnt/dna_pai_tvc/nbilla/llm_model_dump/meta-llama/Meta-Llama-3.1-405B-Instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/dbfs/mnt/dna_pai_tvc/nbilla/llm_model_dump/meta-llama/Meta-Llama-3.1-405B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=64, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=1, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3.1-405b-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f499e698ea0>)
INFO 07-30 09:23:58 config.py:715] Defaulting to use ray for distributed inference
2024-07-30 09:23:58,467 INFO worker.py:1429 -- Using address 100.70.57.94:9322 set in the environment variable RAY_ADDRESS
2024-07-30 09:23:58,468 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 100.70.57.94:9322...
2024-07-30 09:23:58,475 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at 100.70.57.94:9106 
INFO 07-30 09:23:59 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/dbfs/mnt/dna_pai_tvc/nbilla/llm_model_dump/meta-llama/Meta-Llama-3.1-405B-Instruct', speculative_config=None, tokenizer='/dbfs/mnt/dna_pai_tvc/nbilla/llm_model_dump/meta-llama/Meta-Llama-3.1-405B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=64, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=llama-3.1-405b-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/scripts.py", line 148, in main
    args.dispatch_function(args)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/scripts.py", line 28, in serve
    run_server(args)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
    if llm_engine is not None else AsyncLLMEngine.from_engine_args(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
    engine = cls(
             ^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 406, in __init__
    super().__init__(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 61, in _init_executor
    self._init_workers_ray(placement_group)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 148, in _init_workers_ray
    raise ValueError(
ValueError: Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node.
Exception ignored in: <function RayGPUExecutorAsync.__del__ at 0x7f4987fb7ec0>
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 473, in __del__
    if self.forward_dag is not None:
       ^^^^^^^^^^^^^^^^
AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'
zJuuu commented 1 month ago

Hello, please try to set --tensor-parallel-size 4 --pipeline-parallel-size 16

youkaichao commented 1 month ago

There is a limitation of ray, that you must use all GPUs in the cluster.

the driver is also of the same type g5.12x (4xA10)

The problem is you have a driver node, so you have 68 GPUs in total.

nivibilla commented 1 month ago

Ah I see. Thanks @youkaichao I will try using the exact number of GPUs.

Dineshkumar-Anandan-ZS0367 commented 3 weeks ago

Did you fix this issue?

nivibilla commented 2 weeks ago

Not this one specifically but I did manage to run llama 3 70b on 8 nodes. While using exactly 64 gpus.

youkaichao commented 2 weeks ago

@nivibilla just curious, why did you use 64 GPUs to run llama 3 70B? It's not resource-efficient at all. I don't think you can get any benefit.

nivibilla commented 2 weeks ago

@youkaichao with pipeline parallel you get pretty good speedups.

I used tp=8 and pp=8.

The most efficient way would be to standup 8 standalone clusters and load balance. But tp and pp is good enough for me since it's offline Inference.

I get about 6x the through put for 8x the compute. So not too bad.

youkaichao commented 2 weeks ago

good to know, it works if you just want throughput and have enough requests to saturate the pipeline.

nivibilla commented 2 weeks ago

INFO 08-22 17:52:17 model_runner.py:720] Starting to load model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8/...
(RayWorkerWrapper pid=364392, ip=10.168.80.7) INFO 08-22 17:52:17 model_runner.py:720] Starting to load model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8/...
Loading safetensors checkpoint shards:   0% Completed | 0/86 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   1% Completed | 1/86 [00:00<00:09,  8.91it/s]
Loading safetensors checkpoint shards:   2% Completed | 2/86 [00:00<00:11,  7.01it/s]
Loading safetensors checkpoint shards:   3% Completed | 3/86 [00:00<00:12,  6.42it/s]
Loading safetensors checkpoint shards:   5% Completed | 4/86 [00:00<00:13,  6.29it/s]
Loading safetensors checkpoint shards:   6% Completed | 5/86 [00:59<28:59, 21.48s/it]
Loading safetensors checkpoint shards:   7% Completed | 6/86 [02:27<58:37, 43.97s/it]
Loading safetensors checkpoint shards:   8% Completed | 7/86 [05:06<1:47:12, 81.42s/it]
Loading safetensors checkpoint shards:   9% Completed | 8/86 [06:32<1:47:53, 83.00s/it]
Loading safetensors checkpoint shards:  10% Completed | 9/86 [09:28<2:23:41, 111.97s/it]
Loading safetensors checkpoint shards:  12% Completed | 10/86 [11:08<2:17:15, 108.36s/it]
Loading safetensors checkpoint shards:  13% Completed | 11/86 [14:11<2:43:56, 131.15s/it]
Loading safetensors checkpoint shards:  14% Completed | 12/86 [15:41<2:26:26, 118.73s/it]
Loading safetensors checkpoint shards:  15% Completed | 13/86 [18:39<2:46:08, 136.56s/it]
Loading safetensors checkpoint shards:  16% Completed | 14/86 [20:17<2:29:59, 124.99s/it]
Loading safetensors checkpoint shards:  17% Completed | 15/86 [23:18<2:48:00, 141.98s/it]
Loading safetensors checkpoint shards:  19% Completed | 16/86 [24:50<2:27:56, 126.81s/it]
Loading safetensors checkpoint shards:  20% Completed | 17/86 [27:58<2:47:09, 145.35s/it]
Loading safetensors checkpoint shards:  21% Completed | 18/86 [29:25<2:24:51, 127.82s/it]
Loading safetensors checkpoint shards:  22% Completed | 19/86 [32:21<2:38:56, 142.33s/it]
Loading safetensors checkpoint shards:  23% Completed | 20/86 [33:50<2:18:51, 126.24s/it]
Loading safetensors checkpoint shards:  24% Completed | 21/86 [36:34<2:29:07, 137.65s/it]
Loading safetensors checkpoint shards:  26% Completed | 22/86 [37:57<2:09:07, 121.06s/it]
Loading safetensors checkpoint shards:  27% Completed | 23/86 [41:01<2:27:01, 140.03s/it]
Loading safetensors checkpoint shards:  28% Completed | 24/86 [42:37<2:10:53, 126.66s/it]
(RayWorkerWrapper pid=382632, ip=10.168.85.146) INFO 08-22 18:37:41 model_runner.py:732] Loading model weights took 6.4997 GB
(RayWorkerWrapper pid=406657, ip=10.168.90.165) INFO 08-22 17:52:16 utils.py:841] Found nccl from library libnccl.so.2 [repeated 62x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=406657, ip=10.168.90.165) INFO 08-22 17:52:16 pynccl.py:63] vLLM is using nccl==2.22.3 [repeated 62x across cluster]
(RayWorkerWrapper pid=406657, ip=10.168.90.165) WARNING 08-22 17:52:17 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes. [repeated 62x across cluster]
(RayWorkerWrapper pid=408527, ip=10.168.88.194) INFO 08-22 17:52:17 model_runner.py:720] Starting to load model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8/... [repeated 62x across cluster]
(RayWorkerWrapper pid=404707, ip=10.168.90.165) INFO 08-22 18:37:46 model_runner.py:732] Loading model weights took 6.4997 GB [repeated 21x across cluster]
Loading safetensors checkpoint shards:  29% Completed | 25/86 [45:28<2:22:33, 140.22s/it]
(RayWorkerWrapper pid=368011, ip=10.168.94.39) INFO 08-22 18:37:55 model_runner.py:732] Loading model weights took 6.4997 GB [repeated 27x across cluster]
Loading safetensors checkpoint shards:  30% Completed | 26/86 [46:58<2:04:58, 124.98s/it]
Loading safetensors checkpoint shards:  31% Completed | 27/86 [49:58<2:19:05, 141.45s/it]
Loading safetensors checkpoint shards:  33% Completed | 28/86 [51:20<1:59:32, 123.67s/it]
Loading safetensors checkpoint shards:  34% Completed | 29/86 [54:10<2:10:46, 137.66s/it]
Loading safetensors checkpoint shards:  35% Completed | 30/86 [55:39<1:54:49, 123.03s/it]
Loading safetensors checkpoint shards:  36% Completed | 31/86 [58:35<2:07:18, 138.88s/it]
Loading safetensors checkpoint shards:  37% Completed | 32/86 [59:57<1:49:31, 121.70s/it]
Loading safetensors checkpoint shards:  38% Completed | 33/86 [1:02:48<2:00:48, 136.76s/it]
Loading safetensors checkpoint shards:  40% Completed | 34/86 [1:04:11<1:44:23, 120.46s/it]
Loading safetensors checkpoint shards:  41% Completed | 35/86 [1:06:54<1:53:18, 133.30s/it]
Loading safetensors checkpoint shards:  42% Completed | 36/86 [1:08:29<1:41:23, 121.67s/it]
Loading safetensors checkpoint shards:  43% Completed | 37/86 [1:11:35<1:55:20, 141.23s/it]
Loading safetensors checkpoint shards:  44% Completed | 38/86 [1:12:55<1:38:04, 122.60s/it]
Loading safetensors checkpoint shards:  45% Completed | 39/86 [1:15:52<1:48:50, 138.94s/it]
Loading safetensors checkpoint shards:  47% Completed | 40/86 [1:17:16<1:34:03, 122.68s/it]
Loading safetensors checkpoint shards:  48% Completed | 41/86 [1:20:18<1:45:10, 140.23s/it]
Loading safetensors checkpoint shards:  49% Completed | 42/86 [1:21:49<1:32:06, 125.61s/it]
Loading safetensors checkpoint shards:  50% Completed | 43/86 [1:24:59<1:43:50, 144.89s/it]
Loading safetensors checkpoint shards:  51% Completed | 44/86 [1:26:26<1:29:14, 127.49s/it]
Loading safetensors checkpoint shards:  52% Completed | 45/86 [1:29:19<1:36:29, 141.22s/it]
Loading safetensors checkpoint shards:  53% Completed | 46/86 [1:31:04<1:26:49, 130.23s/it]
Loading safetensors checkpoint shards:  55% Completed | 47/86 [1:33:55<1:32:36, 142.47s/it]
Loading safetensors checkpoint shards:  56% Completed | 48/86 [1:35:34<1:22:01, 129.51s/it]
Loading safetensors checkpoint shards:  57% Completed | 49/86 [1:38:19<1:26:28, 140.24s/it]
Loading safetensors checkpoint shards:  58% Completed | 50/86 [1:39:48<1:14:54, 124.84s/it]
Loading safetensors checkpoint shards:  59% Completed | 51/86 [1:42:40<1:21:07, 139.07s/it]
Loading safetensors checkpoint shards:  60% Completed | 52/86 [1:44:03<1:09:11, 122.11s/it]
Loading safetensors checkpoint shards:  62% Completed | 53/86 [1:47:11<1:18:01, 141.87s/it]
Loading safetensors checkpoint shards:  63% Completed | 54/86 [1:48:32<1:05:51, 123.49s/it]
Loading safetensors checkpoint shards:  64% Completed | 55/86 [1:51:26<1:11:38, 138.66s/it]
Loading safetensors checkpoint shards:  65% Completed | 56/86 [1:52:47<1:00:42, 121.42s/it]
Loading safetensors checkpoint shards:  66% Completed | 57/86 [1:55:38<1:05:49, 136.20s/it]
Loading safetensors checkpoint shards:  67% Completed | 58/86 [1:57:08<57:05, 122.34s/it]
Loading safetensors checkpoint shards:  69% Completed | 59/86 [2:00:04<1:02:18, 138.48s/it]
Loading safetensors checkpoint shards:  70% Completed | 60/86 [2:01:49<55:38, 128.40s/it]
Loading safetensors checkpoint shards:  71% Completed | 61/86 [2:04:18<56:07, 134.70s/it]
Loading safetensors checkpoint shards:  72% Completed | 62/86 [2:05:22<45:26, 113.59s/it]
Loading safetensors checkpoint shards:  73% Completed | 63/86 [2:08:11<49:55, 130.25s/it]
Loading safetensors checkpoint shards:  74% Completed | 64/86 [2:09:45<43:42, 119.21s/it]
Loading safetensors checkpoint shards:  76% Completed | 65/86 [2:12:26<46:06, 131.74s/it]
Loading safetensors checkpoint shards:  77% Completed | 66/86 [2:13:59<40:02, 120.11s/it]
Loading safetensors checkpoint shards:  78% Completed | 67/86 [2:16:57<43:33, 137.53s/it]
Loading safetensors checkpoint shards:  79% Completed | 68/86 [2:18:27<36:56, 123.14s/it]
Loading safetensors checkpoint shards:  80% Completed | 69/86 [2:21:34<40:21, 142.44s/it]
Loading safetensors checkpoint shards:  81% Completed | 70/86 [2:23:11<34:21, 128.86s/it]
Loading safetensors checkpoint shards:  83% Completed | 71/86 [2:26:09<35:52, 143.48s/it]
Loading safetensors checkpoint shards:  84% Completed | 72/86 [2:27:31<29:09, 124.98s/it]
Loading safetensors checkpoint shards:  85% Completed | 73/86 [2:30:34<30:54, 142.62s/it]
Loading safetensors checkpoint shards:  86% Completed | 74/86 [2:32:04<25:22, 126.86s/it]
Loading safetensors checkpoint shards:  87% Completed | 75/86 [2:35:03<26:04, 142.22s/it]
Loading safetensors checkpoint shards:  88% Completed | 76/86 [2:36:31<21:00, 126.08s/it]
Loading safetensors checkpoint shards:  90% Completed | 77/86 [2:39:46<22:00, 146.74s/it]
Loading safetensors checkpoint shards:  91% Completed | 78/86 [2:41:18<17:22, 130.32s/it]
Loading safetensors checkpoint shards:  92% Completed | 79/86 [2:44:13<16:46, 143.80s/it]
Loading safetensors checkpoint shards:  93% Completed | 80/86 [2:45:47<12:53, 128.90s/it]
Loading safetensors checkpoint shards:  94% Completed | 81/86 [2:49:09<12:33, 150.63s/it]
Loading safetensors checkpoint shards:  95% Completed | 82/86 [2:50:35<08:45, 131.49s/it]
Loading safetensors checkpoint shards:  97% Completed | 83/86 [2:53:34<07:16, 145.59s/it]
Loading safetensors checkpoint shards:  98% Completed | 84/86 [2:55:11<04:22, 131.04s/it]
Loading safetensors checkpoint shards:  99% Completed | 85/86 [2:57:57<02:21, 141.57s/it]
Loading safetensors checkpoint shards: 100% Completed | 86/86 [2:58:00<00:00, 99.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 86/86 [2:58:00<00:00, 124.19s/it]

(RayWorkerWrapper pid=886068) INFO 08-22 20:50:18 model_runner.py:732] Loading model weights took 6.4997 GB [repeated 8x across cluster]
INFO 08-22 20:50:18 model_runner.py:732] Loading model weights took 6.4997 GB
INFO 08-22 20:52:08 distributed_gpu_executor.py:56] # GPU blocks: 11593, # CPU blocks: 4161
WARNING 08-22 20:52:16 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-22 20:52:16 launcher.py:14] Available routes are:
INFO 08-22 20:52:16 launcher.py:22] Route: /openapi.json, Methods: GET, HEAD
INFO 08-22 20:52:16 launcher.py:22] Route: /docs, Methods: GET, HEAD
INFO 08-22 20:52:16 launcher.py:22] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-22 20:52:16 launcher.py:22] Route: /redoc, Methods: GET, HEAD
INFO 08-22 20:52:16 launcher.py:22] Route: /health, Methods: GET
INFO 08-22 20:52:16 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-22 20:52:16 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-22 20:52:16 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-22 20:52:16 launcher.py:22] Route: /version, Methods: GET
INFO 08-22 20:52:16 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-22 20:52:16 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-22 20:52:16 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [883391]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:1234/ (Press CTRL+C to quit)
INFO 08-22 20:52:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:52:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:52:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:52:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:58:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:58:13 logger.py:36] Received request chat-0ebfc35455b54c299145ac43d8e221c6: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_

*** WARNING: max output size exceeded, skipping output. ***

INFO:     10.168.82.253:36612 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/local_disk0/.ephemeral_nfs/envs/pyth```
nivibilla commented 2 weeks ago

Took 3hrs to load and then fails at first request:(

youkaichao commented 2 weeks ago

@nivibilla I think you should open a new issue for the fp8 model.

youkaichao commented 2 weeks ago

I'm closing this issue, because this bug:

ValueError: Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node.

is fixed by https://github.com/vllm-project/vllm/pull/7584 .

Your cluster can have, say, 10 nodes, and you can use only 2 nodes for vllm now.