Each experiment ran for about 100 seconds for each QPS value. For all three experiments, we used the same input prompt (ShareGPT) and observed a similar output length.
Using LoRA adapters led to a notable degradation in throughput and latency compared to the base model. Specifically, we observed up to a 50% drop in maximum throughput with LoRA compared to the base model.
Is this performance degradation expected with LoRA adapters?
Are there parameters or tuning options that could improve LoRA performance?
Information
[X] Docker
[ ] The CLI directly
Tasks
[X] An officially supported command
[ ] My own modifications
Reproduction
Sample Query:
curl -i ${IP}:${PORT}//generate -X POST -d '{
"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
"parameters": {
"max_new_tokens": 10,
"adapter_ids" : "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"
}
}' -H 'Content-Type: application/json'
After reviewing the LoRAX blog, particularly the statement:
"Processing 1M tokens spread evenly across 32 different fine-tuned models takes just about as much time as processing the same number of tokens for 1 fine-tuned model due to the near-optimal multi-adapter batching throughput associated with LoRAX."
I anticipated a smaller decrease in latency and throughput when using LoRA adapters. Could you please clarify how the savings in Figure 1 were calculated? It would be helpful to understand if this level of performance degradation is typical and if there are any specific tuning options that might help mitigate this. Thank you for your guidance.
System Info
Setup Summary for LoRAX Benchmarking with Llama-2 Model:
meta-llama/Llama-2-7b-hf
Experiments:å
meta-llama/Llama-2-7b-hf
.vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
(size 160 MB).xtuner/Llama-2-7b-qlora-moss-003-sft
(size 640 MB).Each experiment ran for about 100 seconds for each QPS value. For all three experiments, we used the same input prompt (ShareGPT) and observed a similar output length.
Benchmark Metrics: We measured:
You can view detailed results in the benchmark document: Benchmark 1 server - LoRAX.pdf
Observations and Questions:
Information
Tasks
Reproduction
Sample Query:
Deployment YAML Configuration:
Expected Behavior
After reviewing the LoRAX blog, particularly the statement:
I anticipated a smaller decrease in latency and throughput when using LoRA adapters. Could you please clarify how the savings in Figure 1 were calculated? It would be helpful to understand if this level of performance degradation is typical and if there are any specific tuning options that might help mitigate this. Thank you for your guidance.