vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.23k stars 3.3k forks source link

[Feature]: Add num_requests_preempted metric #5051

Open sathyanarays opened 1 month ago

sathyanarays commented 1 month ago

🚀 The feature, motivation and pitch

There are metrics that give an idea about the number of requests that are currently running and waiting through the metrics: num_requests_running and num_requests_waiting. But, these metrics alone does not give an idea about if the requests are getting thrashed and thus underutilizing GPUs.

The new proposed metric num_requests_preempted that reflects the number of requests preempted and waiting for execution would provide idea about request thrashing. This provides the high-level schedulers to avoid adding new requests to the thrashing GPUs.

Alternatives

No response

Additional context

from openai import OpenAI
import threading
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

def query():
    client.completions.create(
    model="facebook/opt-125m",
    prompt = "Sachin Tendulkar is",
    max_tokens=2040,
    n=1
    )

threads = []
for i in range(1000):
    thread = threading.Thread(target=query)
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

I ran the above script on a vLLM server started with command python -m vllm.entrypoints.openai.api_server in a machine with one NVIDIA T400; below are my observations:

Observation 1

As soon as the query is executed, the num_requests_running reaches 256. At this point, the gpu_cache_usage_perc is 6 percent.

Observation 2

After some time, the gpu_cache_usage_perc shoots up to 99 percent

Observation 3

Gradually, the num_requests_running comes down to 100 while the gpu_cache_usage_perc remains in the 99 percent range.

Observation 4

Gradually, the num_requests_running goes up to 256 while the gpu_cache_usage_perc remains in the 99 percent range.

Observation 3 and Observation 4 are repeated till all the requests are completed. The metrics num_requests_running and gpu_cache_usage_perc had to be correlated to understand that the requests are getting thrashed. It would be great if we can provide num_requests_preempted as this would give direct measure of thrashing.

sathyanarays commented 1 month ago

Might be related to request_with_evicted_tokens and total_evicted_tokens in https://github.com/vllm-project/vllm/issues/5041