Open rkooo567 opened 4 months ago
@rkooo567 can you share an example to reproduce this issue?
I think you can simply create a test case by adding chunked prefill to any lora correctness test!
@rkooo567 actually, when I run tests/lora/test_llama.py
it passed. However, when I run examples/multilora_inference.py
with chunked prefill the results are not matching results without chunked prefill. So, want to make sure we are talking about the same issue, I am trying to look into this on my side as well.
@rkooo567 also are you seeing garbage output or an error?
you mean This also means lora is not working with sampling_params do_sample=False.
? @rkooo567
@rkooo567 actually, when I run tests/lora/test_llama.py it passed. However, when I run examples/multilora_inference.py with chunked prefill the results are not matching results without chunked prefill. So, want to make sure we are talking about the same issue, I am trying to look into this on my side as well.
Hi, I just this. I think the loral + chunked prefill now is basically broken because lora assumes some index mapping that only works with default scheduling policy. I think the side effect could be wrong output or crash
you mean This also means lora is not working with sampling_params do_sample=False.? @rkooo567
Yes! that's right
🚀 The feature, motivation and pitch
Currently lora doesn't work with chunked prefill because some of lora index logic doesn't cover the case where sampling is not required. This also means lora is not working with sampling_params do_sample=True.
We need to add test cases for these. WIP https://github.com/vllm-project/vllm/pull/4994
Alternatives
No response
Additional context
No response