vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.98k stars 4.31k forks source link

[Bug]: Low trhoughput on AMD MI250 using llama 3.1 (6 toks/s) #8698

Closed huberemanuel closed 3 weeks ago

huberemanuel commented 1 month ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... WARNING 09-21 15:25:16 rocm.py:14] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead. 2024-09-21 15:25:17.467502: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2024-09-21 15:25:17.520263: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. PyTorch version: 2.6.0.dev20240915+rocm6.1 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 6.1.40091-a8dbc0c19 OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.3 Libc version: glibc-2.35 Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-5.15.0-121-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: AMD Instinct MI210 (gfx90a:sramecc+:xnack-) Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: 6.1.40091 MIOpen runtime version: 3.1.0 Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: AuthenticAMD Model name: AMD EPYC 9124 16-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 Stepping: 1 BogoMIPS: 5999.65 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d Virtualization: AMD-V L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 32 MiB (32 instances) L3 cache: 128 MiB (8 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] onnxruntime==1.19.2 [pip3] pytorch-triton-rocm==3.1.0+5fe38ffd73 [pip3] pyzmq==26.2.0 [pip3] sentence-transformers==3.1.0 [pip3] torch==2.6.0.dev20240915+rocm6.1 [pip3] torchaudio==2.5.0.dev20240916+rocm6.1 [pip3] torchvision==0.20.0.dev20240916+rocm6.1 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] pytorch-triton-rocm 3.1.0+5fe38ffd73 pypi_0 pypi [conda] pyzmq 26.2.0 py310h71f11fc_1 conda-forge [conda] sentence-transformers 3.1.0 pypi_0 pypi [conda] torch 2.6.0.dev20240915+rocm6.1 pypi_0 pypi [conda] torchaudio 2.5.0.dev20240916+rocm6.1 pypi_0 pypi [conda] torchvision 0.20.0.dev20240916+rocm6.1 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: 6.1.40093-bd86f1708 Neuron SDK Version: N/A vLLM Version: 0.6.1.post2@1c1bb388e0d35a2d10da5c5cda2edac57bf62591 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: ============================ ROCm System Management Interface ============================ ================================ Weight between two GPUs ================================= GPU0 GPU1 GPU0 0 72 GPU1 72 0 ================================= Hops between two GPUs ================================== GPU0 GPU1 GPU0 0 3 GPU1 3 0 =============================== Link Type between two GPUs =============================== GPU0 GPU1 GPU0 0 PCIE GPU1 PCIE 0 ======================================= Numa Nodes ======================================= GPU[0] : (Topology) Numa Node: 0 GPU[0] : (Topology) Numa Affinity: 0 GPU[1] : (Topology) Numa Node: 1 GPU[1] : (Topology) Numa Affinity: 1 ================================== End of ROCm SMI Log =================================== ```

Model Input Dumps

No response

🐛 Describe the bug

I'm running vllm serve with vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct --max-num-seqs 1024 --max-num-batched-tokens 1024. When I send requests to the vllm api, I monitor the gpu usage, it reaches 100% usage and memory usage above 90%, therefore, the GPU is being used but the trhoughput is between 3-6 tok/s. I'm using rocm 6.1.2, and I installed vllm from source (main 1c1bb388e0d35a2d10da5c5cda2edac57bf62591). To give a comparison, ollama with llama3.1 get's 85 toks/s with the same rocm version.

Before submitting a new issue...

hongxiayang commented 1 month ago

@huberemanuel There are two things you can try to see whether the performance improves: (1) In the default mode, Triton flash attention is used. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers. (2) Alternatively, please check CK flash-attention. Please use this flag export VLLM_USE_TRITON_FLASH_ATTN=0 to turn off triton flash attention, and then compare the numbers.

In addition, you can turn on TunableOps to improve the performance. THe steps is below: (1) Enable TunableOp and tuning. Optionally enable verbose mode: PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 your_command

(2)Enable TunableOp and disable tuning and measure: PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 your command

Thanks, and please let me know.

hongxiayang commented 3 weeks ago

@huberemanuel Any update on this?

huberemanuel commented 3 weeks ago

Hi! I had some help from @daviswer and I was able to achieve better results. In summary, I used the following versions:

Also, using the following script:

export HIP_FORCE_DEV_KERNARG=1
export VLLM_INSTALL_PUNICA_KERNELS=1
export VLLM_USE_ROCM_CUSTOM_PAGED_ATTN=1
export VLLM_USE_TRITON_FLASH_ATTN=0
export PYTHONPATH=$PYTONPATH:/opt/rocm/share/amd_smi

python3 ./benchmarks/benchmark_throughput.py   --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3.1-8B-Instruct -tp=1 --dtype=float16

With those versions, I was able to achieve the following result:

Processed prompts: 100% 1000/1000 [03:07<00:00,  5.32it/s, est. speed input: 1145.65 toks/s, output: 1055.93 toks/s]
Throughput: 5.31 requests/s, 2196.35 tokens/s
hongxiayang commented 3 weeks ago

ok. Can this issue be closed now? @huberemanuel