Closed Rares9999 closed 6 months ago
AsyncLLMEngine support LoRA, refer to :https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L558
@jeejeelee Yes, vllm supports. Enable lora and create LoRARequest when generate. We use fast api example, modify main with:
args.enable_lora=True
args.max_loras=1
args.max_lora_rank=8
args.max_cpu_loras=2
args.max_num_seqs=256
modify generate with:
results_generator = engine.generate(prompt, sampling_params, request_id,LoRARequest("jho8useyrjbhkwuyu", 1, path_to_lora_folder))
run api_server.py, service started, than run command in terminal:
curl -X POST http://127.0.0.1:8000/generate -H 'Content-Type: application/json' -d '{"prompt":"hello, how do you do? "}'
service log:
Received request 54c05950315b48f48de87e45088ab2f3: prompt: 'hello, how do you do? ', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: LoRARequest(lora_name='jho8useyrjbhkwuyu', lora_int_id=1, lora_local_path=path_to_lora_folder), lora_request: None.
and dump error log:
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f990a1a0dc0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f990837a680>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f990a1a0dc0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f990837a680>>)>
Traceback (most recent call last):
File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 439, in engine_step
await self.engine.add_request_async(**new_request)
File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 258, in add_request_async
return self.add_request(request_id,
File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in add_request
seq = Sequence(seq_id, prompt, prompt_token_ids, block_size,
File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/sequence.py", line 208, in __init__
self._append_tokens_to_blocks(prompt_token_ids)
File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/sequence.py", line 248, in _append_tokens_to_blocks
while cursor < len(token_ids):
TypeError: object of type 'LoRARequest' has no len()
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/home/wonder/anaconda3/envs/vllm-310/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
@Rares9999
rank0: Traceback (most recent call last):
rank0: File "/home/data/app/vllm/vllm_api_Qwen.py", line 196, in
rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args rank0: engine = cls( rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 236, in init
rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches
rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 75, in determine_num_available_blocks rank0: return self.driver_worker.determine_num_available_blocks() rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run rank0: self.execute_model(seqs, kv_caches) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 749, in execute_model rank0: hidden_states = model_executable( rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 330, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 254, in forward rank0: hidden_states, residual = layer( rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 216, in forward rank0: hidden_states = self.mlp(hidden_states) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 75, in forward rank0: gateup, = self.gate_up_proj(x) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/lora/layers.py", line 470, in forward rank0: outputparallel = self.apply(input, bias) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/lora/layers.py", line 600, in apply
rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/lora/layers.py", line 129, in _apply_lora_packed_nslice rank0: add_lora_slice(output, x, lora_a_stacked[slice_idx], rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/lora/punica.py", line 196, in add_lora_slice
rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/_custom_ops.py", line 34, in wrapper rank0: return fn(*args, **kwargs) rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/_custom_ops.py", line 472, in dispatch_bgmv_low_level
rank0: File "/home/fist/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/ops.py", line 854, in call rank0: return self._op(*args, **(kwargs or {})) rank0: RuntimeError: No suitable kernel. h_in=8 h_out=18944 dtype=Float out_dtype=BFloat16
Your current environment
How would you like to use vllm
We want to use LoRA with AsyncLLMEngine, but there is no examples. We mimicked official tutorial , add LoRARequest to generate method, but got error message:
Is that possible use LoRA adapter with AsyncLLMEngine?