vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.95k stars 4.52k forks source link

[Usage]: 请问vllm怎么开启并发推理 #7578

Open backtime1 opened 2 months ago

backtime1 commented 2 months ago

Your current environment

PyTorch version: 2.4.0 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.5 Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA RTX A6000 GPU 1: NVIDIA RTX A6000 GPU 2: NVIDIA RTX A6000 GPU 3: NVIDIA RTX A6000

Nvidia driver version: 560.28.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: 架构: x86_64 CPU 运行模式: 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual 字节序: Little Endian CPU: 256 在线 CPU 列表: 0-255 厂商 ID: AuthenticAMD 型号名称: AMD EPYC 7H12 64-Core Processor CPU 系列: 23 型号: 49 每个核的线程数: 2 每个座的核数: 64 座: 2 步进: 0 Frequency boost: enabled CPU 最大 MHz: 2600.0000 CPU 最小 MHz: 1500.0000 BogoMIPS: 5190.46 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es 虚拟化: AMD-V L1d 缓存: 4 MiB (128 instances) L1i 缓存: 4 MiB (128 instances) L2 缓存: 64 MiB (128 instances) L3 缓存: 512 MiB (32 instances) NUMA 节点: 2 NUMA 节点0 CPU: 0-63,128-191 NUMA 节点1 CPU: 64-127,192-255 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu11==2.20.5 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] onnxruntime-gpu==1.16.0 [pip3] pytorch-lightning==2.3.3 [pip3] pyzmq==26.0.3 [pip3] torch==2.4.0 [pip3] torchaudio==2.4.0 [pip3] torchmetrics==1.4.0.post0 [pip3] torchvision==0.19.0 [pip3] transformers==4.43.2 [pip3] triton==2.3.1 [conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch [conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py311h5eee18b_1
[conda] mkl_fft 1.3.8 py311h5eee18b_0
[conda] mkl_random 1.2.4 py311hdb19cb5_0
[conda] numpy 1.26.4 py311h08b1b3b_0
[conda] numpy-base 1.26.4 py311hf175353_0
[conda] nvidia-nccl-cu11 2.20.5 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] pytorch 2.4.0 py3.11_cuda12.1_cudnn9.1.0_0 pytorch [conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch [conda] pytorch-lightning 2.3.3 pypi_0 pypi [conda] pytorch-mutex 1.0 cuda pytorch [conda] pyzmq 26.0.3 pypi_0 pypi [conda] torch 2.3.1 pypi_0 pypi [conda] torchaudio 2.4.0 py311_cu121 pytorch [conda] torchmetrics 1.4.0.post0 pypi_0 pypi [conda] torchtriton 3.0.0 py311 pytorch [conda] torchvision 0.18.1+cu118 pypi_0 pypi [conda] transformers 4.43.2 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4@4db5176d9758b720b05460c50ace3c01026eb158 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV4 SYS SYS 0-63,128-191 0 N/A GPU1 NV4 X SYS SYS 0-63,128-191 0 N/A GPU2 SYS SYS X NV4 64-127,192-255 1 N/A GPU3 SYS SYS NV4 X 64-127,192-255 1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

vllm怎么开启并发

jeejeelee commented 2 months ago

See: https://docs.vllm.ai/en/latest/serving/distributed_serving.html