vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.74k stars 4.1k forks source link

[Feature]: Chunked prefill + lora #4995

Open rkooo567 opened 4 months ago

rkooo567 commented 4 months ago

🚀 The feature, motivation and pitch

Currently lora doesn't work with chunked prefill because some of lora index logic doesn't cover the case where sampling is not required. This also means lora is not working with sampling_params do_sample=True.

We need to add test cases for these. WIP https://github.com/vllm-project/vllm/pull/4994

Alternatives

No response

Additional context

No response

rohithkrn commented 3 months ago

@rkooo567 can you share an example to reproduce this issue?

rkooo567 commented 3 months ago

I think you can simply create a test case by adding chunked prefill to any lora correctness test!

rkooo567 commented 3 months ago

https://github.com/vllm-project/vllm/tree/main/tests/lora

rohithkrn commented 3 months ago

@rkooo567 actually, when I run tests/lora/test_llama.py it passed. However, when I run examples/multilora_inference.py with chunked prefill the results are not matching results without chunked prefill. So, want to make sure we are talking about the same issue, I am trying to look into this on my side as well.

rohithkrn commented 3 months ago

@rkooo567 also are you seeing garbage output or an error?

sfc-gh-zhwang commented 2 weeks ago

you mean This also means lora is not working with sampling_params do_sample=False.? @rkooo567

rkooo567 commented 2 weeks ago

@rkooo567 actually, when I run tests/lora/test_llama.py it passed. However, when I run examples/multilora_inference.py with chunked prefill the results are not matching results without chunked prefill. So, want to make sure we are talking about the same issue, I am trying to look into this on my side as well.

Hi, I just this. I think the loral + chunked prefill now is basically broken because lora assumes some index mapping that only works with default scheduling policy. I think the side effect could be wrong output or crash

rkooo567 commented 2 weeks ago

you mean This also means lora is not working with sampling_params do_sample=False.? @rkooo567

Yes! that's right