vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.76k stars 3.92k forks source link

[Misc]: I want to run Llama 3.1 405B using speculative. Can you give me a guide? #7456

Open Archmilio opened 1 month ago

Archmilio commented 1 month ago

Anything you want to discuss about vllm.

I am trying to perform a serving performance test using pipeline parallelism with the LLAMA 3.1 405B model as a draft model with 8b, but the model fails to load after being loaded. Could you please guide me on this issue?

$ vllm serve /models/Meta-Llama-3.1-405B-Instruct -tp 8 -pp 2 --speculative-model /models/Meta-Llama-3.1-8B-Instruct --use-v2-block-manager --num-speculative-tokens 5

$Logs mngc-001:7118:7118 [0] NCCL INFO comm 0x134fcb60 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1b000 commId 0x2d7e114b994d6072 - Init COMPLETE rank0: Traceback (most recent call last): rank0: File "/usr/local/bin/vllm", line 8, in

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 149, in main

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 29, in serve

rank0: File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run rank0: return loop.run_until_complete(main) rank0: File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete rank0: return future.result() rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 289, in run_server rank0: app = await init_app(args, llm_engine) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 229, in init_app rank0: if llm_engine is not None else AsyncLLMEngine.from_engine_args( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 464, in from_engine_args rank0: engine = cls( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in init rank0: self.engine = self._init_engine(*args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 545, in _init_engine rank0: return engine_class(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in init rank0: self.model_executor = executor_class( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 405, in init rank0: super().init(args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init rank0: super().init(*args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 62, in _init_executor

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 232, in _init_workers_ray

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 349, in _run_workers rank0: self.driver_worker.execute_method(method, driver_args, rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 383, in execute_method rank0: raise e rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method rank0: return executor(args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 261, in init_device

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 285, in _configure_model_sampler_for_spec_decode

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1709, in getattr rank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") rank0: AttributeError: 'LlamaForCausalLM' object has no attribute 'sampler'. Did you mean: 'sample'? (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 0. [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO NCCL_IB_HCA set to mlx5_0:1,mlx5_3:1,mlx5_6:1,mlx5_7:1 [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_3:1/IB [2]mlx5_6:1/IB [3]mlx5_7:1/IB [RO]; OOB bond-srv.1538:70.227.56.2<0> [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO Using non-device net plugin version 0 [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO Using network IB [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO comm 0xb11d950 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId 14e000 commId 0x189e28e6b2a20297 - Init START [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to PIX [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO Setting affinity for GPU 7 to 0fff00,00000000,00000000 [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO NVLS multicast support is available on dev 7 [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO comm 0xb11d950 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0 [repeated 14x across cluster] (RayWorkerWrapper pid=7348) mngc-001:7348:7348 [1] NCCL INFO Channel 03/0 : 1[ [repeated 2x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO P2P Chunksize set to 524288 [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/IPC [repeated 519x across cluster] (RayWorkerWrapper pid=1853, ip=70.227.56.3) INFO 08-02 13:06:08 utils.py:774] Found nccl from library libnccl.so.2 [repeated 15x across cluster] (RayWorkerWrapper pid=1853, ip=70.227.56.3) INFO 08-02 13:06:08 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 15x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO cudaDriverVersion 12040 [repeated 13x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond-srv.1538 [repeated 26x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO Bootstrap : Using bond-srv.1538:70.227.56.2<0> [repeated 13x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation [repeated 13x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO Connected all rings [repeated 13x across cluster] (RayWorkerWrapper pid=7509) mngc-001:7509:7509 [3] NCCL INFO Channel 03 [repeated 3x across cluster] (RayWorkerWrapper pid=7672) mngc-001:7672:7672 [5] NCCL INFO Cha [repeated 3x across cluster] (RayWorkerWrapper pid=7753) mngc-001:7753:7753 [6] NCCL I (RayWorkerWrapper pid=7834) mngc- (RayWorkerWrapper pid=7834) INFO 08-02 13:06:08 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 14x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO Connected all trees [repeated 6x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO NVLS comm 0xb11d950 headRank 7 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104 [repeated 6x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 [repeated 6x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO 24 coll channels, 0 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer [repeated 6x across cluster] (RayWorkerWrapper pid=7834) mngc-001:7834:7834 [7] NCCL INFO comm 0xb11d950 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId 14e000 commId 0x189e28e6b2a20297 - Init COMPLETE [repeated 6x across cluster] (RayWorkerWrapper pid=7834) NCCL version 2.20.5+cuda12.4 [repeated 6x across cluster] (RayWorkerWrapper pid=7509) /0 : 3[3] -> 2[2] via P2P/IPC (RayWorkerWrapper pid=7672) nnel 03/0 : 5[5] -> 4[4] via P2P/IPC (RayWorkerWrapper pid=1853, ip=70.227.56.3) INFO 08-02 13:06:08 model_runner.py:720] Starting to load model /models/Meta-Llama-3.1-405B-Instruct... repeated 14x across cluster:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

cadedaniel commented 1 month ago

If you can run with the fp8 weights instead, it will work. Otherwise need https://github.com/vllm-project/vllm/issues/6911 to be done.