Open rrichajalota opened 8 months ago
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
I am using vLLM endpoint with OpenAI API to send concurrent requests to Llama2-7B model that's deployed on a single A100 GPU. Regardless of the values I set for
--block-size
,--swap-space
,--max-num-seqs
or--max-num-batched-tokens
, the GPU utilization always fluctuates between 65%-75% (momentarily, also goes lower or higher than this range) .Is there a way optimize GPU utilization and consequently enhance throughput?
I am testing with 400 prompts (i.e. 400 concurrent requests) and also scaling up to 1200 requests but to no effect.
Configuration:
Any help would be appreciated!