Closed Stealthwriter closed 5 months ago
Should be fixed in the next release, by #4609.
Thanks, do you know when? or do you have an idea how I can go around this? It's happening when I run lora on a 70b model which is running on 2 GPU, I'm trying to load llama3 70b on a single GPU (a100) but it doesn't seem to work, I run our of vram...
The next release is just around the corner. If you can't wait for that, you can install vLLM from main
branch directly.
Fixed by #4609, which has been released in v0.4.3.
Edit: Technically it is still in pre-release but should be out very soon.
Your current environment
Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe GPU 1: NVIDIA A100 80GB PCIe
Nvidia driver version: 545.23.08 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 252 On-line CPU(s) list: 0-251 Vendor ID: AuthenticAMD Model name: AMD EPYC 7763 64-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 1 Core(s) per socket: 126 Socket(s): 2 Stepping: 1 BogoMIPS: 4890.81 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 15.8 MiB (252 instances) L1i cache: 15.8 MiB (252 instances) L2 cache: 126 MiB (252 instances) L3 cache: 3.9 GiB (252 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-251 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] numpy==1.26.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] torchaudio==2.2.0 [pip3] torchvision==0.17.0 [pip3] triton==2.3.0 [pip3] vllm_nccl_cu12==2.18.1.0.4.0 [conda] Could not collectROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-251 0 N/A GPU1 PHB X 0-251 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
I'm trying to run llama 3 70 b instruct with lora,
this is my start command:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --enable-lora --lora-modules stealthlora=./lora --max-lora-rank 64 --tensor-parallel-size 2
But I'm getting this error on inference:
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 146, in execute_method raise e File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 137, in execute_method return executor(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in execute_model output = self.model_runner.execute_model(seq_group_metadata_list, File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 790, in execute_model self.set_active_loras(lora_requests, lora_mapping) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 901, in set_active_loras self.lora_manager.set_active_loras(lora_requests, lora_mapping) File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 113, in set_active_loras self._apply_loras(lora_requests) File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 235, in _apply_loras self.add_lora(lora) File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 250, in add_lora self._lora_manager.activate_lora(lora_request.lora_int_id) File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 615, in activate_lora result = super().activate_lora(lora_id) File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 355, in activate_lora module.set_lora(index, module_lora.lora_a, module_lora.lora_b, File "/usr/local/lib/python3.10/dist-packages/vllm/lora/layers.py", line 800, in set_lora lora_b = self.slice_lora_b(lora_b) File "/usr/local/lib/python3.10/dist-packages/vllm/lora/layers.py", line 786, in slice_lora_b lora_b = [lora_b_q, lora_b_k, lora_b_v] UnboundLocalError: local variable 'lora_b_k' referenced before assignment
What could be the issue? I tried with mistral 7b lora, its working fine, but with llama3 70 b I get this error.