vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.37k stars 4.4k forks source link

[Usage]: How to use LoRARequest with AsyncLLMEngine? #4203

Closed Rares9999 closed 6 months ago

Rares9999 commented 6 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          20
On-line CPU(s) list:             0-19
Thread(s) per core:              2
Core(s) per socket:              10
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
Stepping:                        5
CPU MHz:                         3700.000
CPU max MHz:                     5300.0000
CPU min MHz:                     800.0000
BogoMIPS:                        7399.70
Virtualization:                  VT-x
L1d cache:                       320 KiB
L1i cache:                       320 KiB
L2 cache:                        2.5 MiB
L3 cache:                        20 MiB
NUMA node0 CPU(s):               0-19
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:          Mitigation; Enhanced IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Mitigation; Microcode
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.1.2                    pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  0-19    0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

We want to use LoRA with AsyncLLMEngine, but there is no examples. We mimicked official tutorial , add LoRARequest to generate method, but got error message:

TypeError: object of type 'LoRARequest' has no len()

Is that possible use LoRA adapter with AsyncLLMEngine?

jeejeelee commented 6 months ago

AsyncLLMEngine support LoRA, refer to :https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L558

Rares9999 commented 6 months ago

@jeejeelee Yes, vllm supports. Enable lora and create LoRARequest when generate. We use fast api example, modify main with:

args.enable_lora=True
args.max_loras=1
args.max_lora_rank=8
args.max_cpu_loras=2
args.max_num_seqs=256

modify generate with:

results_generator = engine.generate(prompt, sampling_params, request_id,LoRARequest("jho8useyrjbhkwuyu", 1, path_to_lora_folder))

run api_server.py, service started, than run command in terminal:

curl -X POST http://127.0.0.1:8000/generate    -H 'Content-Type: application/json'    -d '{"prompt":"hello, how do you do? "}'

service log:

Received request 54c05950315b48f48de87e45088ab2f3: prompt: 'hello, how do you do? ', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: LoRARequest(lora_name='jho8useyrjbhkwuyu', lora_int_id=1, lora_local_path=path_to_lora_folder), lora_request: None.

and dump error log:

Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f990a1a0dc0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f990837a680>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f990a1a0dc0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f990837a680>>)>
Traceback (most recent call last):
  File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 439, in engine_step
    await self.engine.add_request_async(**new_request)
  File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 258, in add_request_async
    return self.add_request(request_id,
  File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in add_request
    seq = Sequence(seq_id, prompt, prompt_token_ids, block_size,
  File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/sequence.py", line 208, in __init__
    self._append_tokens_to_blocks(prompt_token_ids)
  File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/sequence.py", line 248, in _append_tokens_to_blocks
    while cursor < len(token_ids):
TypeError: object of type 'LoRARequest' has no len()

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
wangzhonghai commented 3 months ago

@Rares9999 rank0: Traceback (most recent call last): rank0: File "/home/data/app/vllm/vllm_api_Qwen.py", line 196, in rank0: generation_config,tokenizer,stop_words_ids,engine,lora_request = load_vllm(args_parameter) rank0: File "/home/data/app/vllm/vllm_api_Qwen.py", line 64, in load_vllm

rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args rank0: engine = cls( rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 236, in init

rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches

rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 75, in determine_num_available_blocks rank0: return self.driver_worker.determine_num_available_blocks() rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks

rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run rank0: self.execute_model(seqs, kv_caches) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 749, in execute_model rank0: hidden_states = model_executable( rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 330, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 254, in forward rank0: hidden_states, residual = layer( rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 216, in forward rank0: hidden_states = self.mlp(hidden_states) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 75, in forward rank0: gateup, = self.gate_up_proj(x) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/lora/layers.py", line 470, in forward rank0: outputparallel = self.apply(input, bias) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/lora/layers.py", line 600, in apply

rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/lora/layers.py", line 129, in _apply_lora_packed_nslice rank0: add_lora_slice(output, x, lora_a_stacked[slice_idx], rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/lora/punica.py", line 196, in add_lora_slice

rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/_custom_ops.py", line 34, in wrapper rank0: return fn(*args, **kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/_custom_ops.py", line 472, in dispatch_bgmv_low_level

rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/ops.py", line 854, in call rank0: return self._op(*args, **(kwargs or {})) rank0: RuntimeError: No suitable kernel. h_in=8 h_out=18944 dtype=Float out_dtype=BFloat16