vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.15k stars 4.35k forks source link

[Bug]: Cannot load lora adapters in WSL 2 #3891

Closed invokeinnovation closed 4 months ago

invokeinnovation commented 6 months ago

Your current environment

Collecting environment information... PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.0 Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2070 SUPER Nvidia driver version: 531.14 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: AuthenticAMD Model name: AMD Ryzen 5 5600X 6-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 0 BogoMIPS: 7386.11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat umip vaes vpclmulqdq rdpid Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 192 KiB (6 instances) L1i cache: 192 KiB (6 instances) L2 cache: 3 MiB (6 instances) L3 cache: 32 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.1.2 [pip3] torchaudio==2.1.2 [pip3] torchvision==0.16.2 [pip3] triton==2.1.0 [conda] Could not collectROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.0.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU0 X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/tony/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 157, in engine = AsyncLLMEngine.from_engine_args( File "/home/tony/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args engine = cls( File "/home/tony/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 311, in init self.engine = self._init_engine(*args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine return engine_class(*args, *kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in init self.model_executor = executor_class(model_config, cache_config, File "/home/tony/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 40, in init self._init_cache() File "/home/tony/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 80, in _init_cache self.driver_worker.profile_num_available_blocks( File "/home/tony/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 131, in profile_num_available_blocks self.model_runner.profile_run() File "/home/tony/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 742, in profile_run self.execute_model(seqs, kv_caches) File "/home/tony/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 663, in execute_model hidden_states = model_executable(execute_model_kwargs) File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/model_executor/models/gemma.py", line 332, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/model_executor/models/gemma.py", line 275, in forward hidden_states, residual = layer( File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/model_executor/models/gemma.py", line 221, in forward hidden_states = self.self_attn( File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/modelexecutor/models/gemma.py", line 168, in forward qkv, = self.qkv_proj(hidden_states) File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/tony/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/tony/.local/lib/python3.10/site-packages/vllm/lora/layers.py", line 395, in forward output_parallel = self.applyweights(input, bias) File "/home/tony/.local/lib/python3.10/site-packages/vllm/lora/layers.py", line 750, in apply_weights _apply_lora_packed_nslice( File "/home/tony/.local/lib/python3.10/site-packages/vllm/lora/layers.py", line 97, in _apply_lora_packed_nslice add_lora_slice(output, x, lora_a_stacked[slice_idx], File "/home/tony/.local/lib/python3.10/site-packages/vllm/lora/punica.py", line 146, in add_lora_slice buffer = torch.zeros((x.size(0), r), RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

invokeinnovation commented 6 months ago

Took out "--enable-lora" and base model loaded and lora adapters loaded. But once I query model 400 Bad Request

openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "Got lora_request LoRARequest(lora_name='lora_model', lora_int_id=1, lora_local_path='/mnt/c/Users/tony/Downloads/AI_research/offline/Lora_Adapter') but LoRA is not enabled!", 'type': 'BadRequestError', 'param': None, 'code': 400}

kratorado commented 6 months ago

I met the same problem.

invokeinnovation commented 6 months ago

I had to merge adapters to base model creating new model safe tensors and it worked. I think the issue may be with loading the adapters to base model while using dtype=half. Are you using the dtype=half flag ?

kratorado commented 6 months ago

I had to merge adapters to base model creating new model safe tensors and it worked. I think the issue may be with loading the adapters to base model while using dtype=half. Are you using the dtype=half flag ?

merging lora to create a new model works but it is different from serving lora adapters. Maybe you can have a look at https://github.com/vllm-project/vllm/issues/3826 , "punica.py with lora requires sm 8.0"

check if your gpu is sm8.0 or higher.

invokeinnovation commented 6 months ago

makes sense now