There are metrics that give an idea about the number of requests that are currently running and waiting through the metrics: num_requests_running and num_requests_waiting. But, these metrics alone does not give an idea about if the requests are getting thrashed and thus underutilizing GPUs.
The new proposed metric num_requests_preempted that reflects the number of requests preempted and waiting for execution would provide idea about request thrashing. This provides the high-level schedulers to avoid adding new requests to the thrashing GPUs.
Alternatives
No response
Additional context
from openai import OpenAI
import threading
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
def query():
client.completions.create(
model="facebook/opt-125m",
prompt = "Sachin Tendulkar is",
max_tokens=2040,
n=1
)
threads = []
for i in range(1000):
thread = threading.Thread(target=query)
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
I ran the above script on a vLLM server started with command python -m vllm.entrypoints.openai.api_server in a machine with one NVIDIA T400; below are my observations:
Observation 1
As soon as the query is executed, the num_requests_running reaches 256. At this point, the gpu_cache_usage_perc is 6 percent.
Observation 2
After some time, the gpu_cache_usage_perc shoots up to 99 percent
Observation 3
Gradually, the num_requests_running comes down to 100 while the gpu_cache_usage_perc remains in the 99 percent range.
Observation 4
Gradually, the num_requests_running goes up to 256 while the gpu_cache_usage_perc remains in the 99 percent range.
Observation 3 and Observation 4 are repeated till all the requests are completed. The metrics num_requests_running and gpu_cache_usage_perc had to be correlated to understand that the requests are getting thrashed. It would be great if we can provide num_requests_preempted as this would give direct measure of thrashing.
🚀 The feature, motivation and pitch
There are metrics that give an idea about the number of requests that are currently running and waiting through the metrics:
num_requests_running
andnum_requests_waiting
. But, these metrics alone does not give an idea about if the requests are getting thrashed and thus underutilizing GPUs.The new proposed metric
num_requests_preempted
that reflects the number of requests preempted and waiting for execution would provide idea about request thrashing. This provides the high-level schedulers to avoid adding new requests to the thrashing GPUs.Alternatives
No response
Additional context
I ran the above script on a vLLM server started with command
python -m vllm.entrypoints.openai.api_server
in a machine with one NVIDIA T400; below are my observations:Observation 1
As soon as the query is executed, the
num_requests_running
reaches 256. At this point, thegpu_cache_usage_perc
is 6 percent.Observation 2
After some time, the
gpu_cache_usage_perc
shoots up to 99 percentObservation 3
Gradually, the
num_requests_running
comes down to 100 while thegpu_cache_usage_perc
remains in the 99 percent range.Observation 4
Gradually, the
num_requests_running
goes up to 256 while thegpu_cache_usage_perc
remains in the 99 percent range.Observation 3 and Observation 4 are repeated till all the requests are completed. The metrics
num_requests_running
andgpu_cache_usage_perc
had to be correlated to understand that the requests are getting thrashed. It would be great if we can providenum_requests_preempted
as this would give direct measure of thrashing.