Open sleepwalker2017 opened 8 months ago
This is in line with what we expect. There are two optimization can be applied (help wanted!):
From discussion with @Yard1 offline
Currently we do all of the LoRA stuff even if there are no LoRAs loaded. I think it can be further optimized (orthogonally to getting better kernels).
This means, if anyone can help us conditionally enable Punica kernels for multi-lora only when LoRA is used, that would be wonderful.
This is in line with what we expect. There are two optimization can be applied (help wanted!):
Only enable Punica kernels when LoRA is requested.
From discussion with @Yard1 offline
Currently we do all of the LoRA stuff even if there are no LoRAs loaded. I think it can be further optimized (orthogonally to getting better kernels).
This means, if anyone can help us conditionally enable Punica kernels for multi-lora only when LoRA is used, that would be wonderful.
Optimize Punica kernels for smaller, non-A100/H100 devices, even for lower sm7.5
OK, I'm learning the multiple lora codes in vllm, I'll check the code for this, and if I'm capable to do this, I'll help to send a pr.
This is in line with what we expect. There are two optimization can be applied (help wanted!):
Only enable Punica kernels when LoRA is requested.
From discussion with @Yard1 offline
Currently we do all of the LoRA stuff even if there are no LoRAs loaded. I think it can be further optimized (orthogonally to getting better kernels).
This means, if anyone can help us conditionally enable Punica kernels for multi-lora only when LoRA is used, that would be wonderful.
Optimize Punica kernels for smaller, non-A100/H100 devices, even for lower sm7.5
Hello,still some points to confirm.
311.5 token/sec
, 254.9 token/sec
196 token/sec
.When the optimization is done, the performance for 2
is the same as 1
, but the 3
is still like that, is that right?
Seem the overhead for multiple lora is fairly large.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
I compared two ways to launch the server.
The model is vicuna-7b, and GPU is 2 * A30.
and the 1st way is
The 2nd way is:
In both tests, I send the same request, which sets the model as
/data/models/vicuna-7b-v1.5/
.But the performance differs a lot.