vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.75k stars 4.1k forks source link

[Bug]: Loading LoRA is super slow when using tensor parallel #6072

Closed markovalexander closed 3 months ago

markovalexander commented 3 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.17-2136.321.4.el7uek.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB

Nvidia driver version: 525.125.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          256
On-line CPU(s) list:             0-254
Off-line CPU(s) list:            255
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 7J13 64-Core Processor
CPU family:                      25
Model:                           1
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
Stepping:                        1
Frequency boost:                 enabled
CPU max MHz:                     2550.0000
CPU min MHz:                     0.0000
BogoMIPS:                        4899.55
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local nt_good clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization:                  AMD-V
L1d cache:                       4 MiB (128 instances)
L1i cache:                       4 MiB (128 instances)
L2 cache:                        64 MiB (128 instances)
L3 cache:                        512 MiB (16 instances)
NUMA node(s):                    8
NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-254
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   NIC12   NIC13   NIC14   NIC15   NIC16   NIC17   CPU Affinity    NUMA Affinity
GPU0     X  NV12    NV12    NV12    SYS SYS SYS SYS SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191   3
GPU1    NV12     X  NV12    NV12    SYS SYS SYS SYS SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191   3
GPU2    NV12    NV12     X  NV12    SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 16-31,144-159   1
GPU3    NV12    NV12    NV12     X  SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS 80-95,208-223   5
NIC0    SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC1    SYS SYS PXB SYS SYS  X  PIX PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC2    SYS SYS PXB SYS SYS PIX  X  PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC3    SYS SYS PXB SYS SYS PXB PXB  X  PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC4    SYS SYS PXB SYS SYS PXB PXB PIX  X  SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC5    PXB PXB SYS SYS SYS SYS SYS SYS SYS  X  PIX PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC6    PXB PXB SYS SYS SYS SYS SYS SYS SYS PIX  X  PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC7    PXB PXB SYS SYS SYS SYS SYS SYS SYS PXB PXB  X  PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC8    PXB PXB SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX  X  SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC9    SYS SYS SYS PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX PXB PXB SYS SYS SYS SYS SYS
NIC10   SYS SYS SYS PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX  X  PXB PXB SYS SYS SYS SYS SYS
NIC11   SYS SYS SYS PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB  X  PIX SYS SYS SYS SYS SYS
NIC12   SYS SYS SYS PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX  X  SYS SYS SYS SYS SYS
NIC13   SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS
NIC14   SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX PXB PXB
NIC15   SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX  X  PXB PXB
NIC16   SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB  X  PIX
NIC17   SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
  NIC12: mlx5_12
  NIC13: mlx5_13
  NIC14: mlx5_14
  NIC15: mlx5_15
  NIC16: mlx5_16
  NIC17: mlx5_17

🐛 Describe the bug

Build openai docker from https://github.com/vllm-project/vllm/releases/tag/v0.5.0.post1

Started vLLM for llama-2-70b model with LoRA support and tensor-parallel=4. First lora-request will take more than 1 minute. Problem is in this function: it is very fast on first process and super slow (>40second) on all other processes. Here is log output:

INFO 07-02 14:44:43 async_llm_engine.py:564] Received request cmpl-007f58184aa54c3fb54155c3ad325b95-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_
of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_p
enalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_l
ogprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 3087, 8970, 338, 263], lora_request: L
oRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lora_max_len=None).
INFO:     172.17.0.1:56466 - "POST /v1/completions HTTP/1.1" 200 OK
DEBUG 07-02 14:44:43 async_llm_engine.py:525] Waiting for new requests...
DEBUG 07-02 14:44:43 async_llm_engine.py:527] Got new requests!
WARNING 07-02 14:44:43 tokenizer.py:142] No tokenizer found in /adapters/2, using base model tokenizer instead. (Exception: /adapters/2 does not appear to have a file
 named config.json. Checkout 'https://huggingface.co//adapters/2/tree/None' for available files.)
INFO 07-02 14:44:43 worker_manager.py:165] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lora_max_len=None)
(VllmWorkerProcess pid=14156) INFO 07-02 14:44:43 worker_manager.py:165] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None)
(VllmWorkerProcess pid=14157) INFO 07-02 14:44:43 worker_manager.py:165] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None)
(VllmWorkerProcess pid=14155) INFO 07-02 14:44:43 worker_manager.py:165] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None)
INFO 07-02 14:44:43 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-02 14:44:46 worker_manager.py:188] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lora_max_len=None) finished!
DEBUG 07-02 14:44:46 models.py:749] Adding lora. Model id: 1, int id: 1, scaling factor: None
DEBUG 07-02 14:44:46 models.py:459] Activating LoRA. int id: 1, slot index: 0
INFO 07-02 14:44:53 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-02 14:45:03 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-02 14:45:13 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=14155) INFO 07-02 14:45:23 worker_manager.py:188] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None) finished!
(VllmWorkerProcess pid=14157) INFO 07-02 14:45:23 worker_manager.py:188] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None) finished!
(VllmWorkerProcess pid=14155) DEBUG 07-02 14:45:23 models.py:749] Adding lora. Model id: 1, int id: 1, scaling factor: None
(VllmWorkerProcess pid=14156) INFO 07-02 14:45:23 worker_manager.py:188] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None) finished!
(VllmWorkerProcess pid=14157) DEBUG 07-02 14:45:23 models.py:749] Adding lora. Model id: 1, int id: 1, scaling factor: None
(VllmWorkerProcess pid=14156) DEBUG 07-02 14:45:23 models.py:749] Adding lora. Model id: 1, int id: 1, scaling factor: None
(VllmWorkerProcess pid=14155) DEBUG 07-02 14:45:23 models.py:459] Activating LoRA. int id: 1, slot index: 0
(VllmWorkerProcess pid=14157) DEBUG 07-02 14:45:23 models.py:459] Activating LoRA. int id: 1, slot index: 0
(VllmWorkerProcess pid=14156) DEBUG 07-02 14:45:23 models.py:459] Activating LoRA. int id: 1, slot index: 0
INFO 07-02 14:45:23 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-02 14:45:25 async_llm_engine.py:133] Finished request cmpl-007f58184aa54c3fb54155c3ad325b95-0.
DEBUG 07-02 14:45:25 async_llm_engine.py:525] Waiting for new requests...

As you can see, all 4 processes start loading lora at 14:44:43. But only first one finished at 14:44:46 and 3 other finish only at 14:45:23. What is the problem?

DarkLight1337 commented 3 months ago

You may want to refer to this troubleshooting guide.

sampritipanda commented 3 months ago

I think you maybe able to fix this by changing the following line to device=self.device:

https://github.com/vllm-project/vllm/blob/7c008c51a9aa4c0d53d09ab3a1ba61ecec354565/vllm/lora/worker_manager.py#L180

Be-aware though there is some instability. I was facing crashes in the custom all reduce CUDA kernels when trying to use LoRAs of rank=2, but it worked fine with a LoRA of rank = 8.

Another option is to disable memory pinning here:

https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#L219

Just set this line to be pin_memory = False

markovalexander commented 3 months ago

Thanks for your suggestions @sampritipanda ! I tried them, but they all did not help in my case.

What did help though is adding device='cuda:0' in this line. I understand that this could probably cause slight memory overhead on 1st process, but not sure if this is actually the case since self.device is just cuda and not cuda:PROCESS_IDX meaning that later in code LoRA still moves to 0 device and then gets sharded across all processes.

I also set OMP_NUM_THREADS env to 15 in all processes (just random number), it also slightly improved model and lora loading speed.

tjohnson31415 commented 3 months ago

Hello. I have been investigating this behavior as well. I believe the root cause of the slow down is CPU contention and throttling, particularly in an env like Kubernetes with containers with CPU limits.

Setting OMP_NUM_THREADS to a low number (like 1 or 2) was the solution I found, it reduced the load time for the LoRA adapter I was testing with from 2 mins to 3.5s with TP=4.

My test env was in Kubernetes on a node with 80 total cores, cpu requests set to 8 and cpu limits set to 16. Without OMP_NUM_THREADS, I think each shard would spawn 80 threads and they'd fight for the CPU. Watching the load with top I noticed CPU time exceeding 16, which would result in throttling on top of context switching costs.