Open sathyanarays opened 5 months ago
Might be related to request_with_evicted_tokens
and total_evicted_tokens
in https://github.com/vllm-project/vllm/issues/5041
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
🚀 The feature, motivation and pitch
There are metrics that give an idea about the number of requests that are currently running and waiting through the metrics:
num_requests_running
andnum_requests_waiting
. But, these metrics alone does not give an idea about if the requests are getting thrashed and thus underutilizing GPUs.The new proposed metric
num_requests_preempted
that reflects the number of requests preempted and waiting for execution would provide idea about request thrashing. This provides the high-level schedulers to avoid adding new requests to the thrashing GPUs.Alternatives
No response
Additional context
I ran the above script on a vLLM server started with command
python -m vllm.entrypoints.openai.api_server
in a machine with one NVIDIA T400; below are my observations:Observation 1
As soon as the query is executed, the
num_requests_running
reaches 256. At this point, thegpu_cache_usage_perc
is 6 percent.Observation 2
After some time, the
gpu_cache_usage_perc
shoots up to 99 percentObservation 3
Gradually, the
num_requests_running
comes down to 100 while thegpu_cache_usage_perc
remains in the 99 percent range.Observation 4
Gradually, the
num_requests_running
goes up to 256 while thegpu_cache_usage_perc
remains in the 99 percent range.Observation 3 and Observation 4 are repeated till all the requests are completed. The metrics
num_requests_running
andgpu_cache_usage_perc
had to be correlated to understand that the requests are getting thrashed. It would be great if we can providenum_requests_preempted
as this would give direct measure of thrashing.