vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.34k stars 4.59k forks source link

[Feature]: Add num_requests_preempted metric #5051

Open sathyanarays opened 5 months ago

sathyanarays commented 5 months ago

🚀 The feature, motivation and pitch

There are metrics that give an idea about the number of requests that are currently running and waiting through the metrics: num_requests_running and num_requests_waiting. But, these metrics alone does not give an idea about if the requests are getting thrashed and thus underutilizing GPUs.

The new proposed metric num_requests_preempted that reflects the number of requests preempted and waiting for execution would provide idea about request thrashing. This provides the high-level schedulers to avoid adding new requests to the thrashing GPUs.

Alternatives

No response

Additional context

from openai import OpenAI
import threading
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

def query():
    client.completions.create(
    model="facebook/opt-125m",
    prompt = "Sachin Tendulkar is",
    max_tokens=2040,
    n=1
    )

threads = []
for i in range(1000):
    thread = threading.Thread(target=query)
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

I ran the above script on a vLLM server started with command python -m vllm.entrypoints.openai.api_server in a machine with one NVIDIA T400; below are my observations:

Observation 1

As soon as the query is executed, the num_requests_running reaches 256. At this point, the gpu_cache_usage_perc is 6 percent.

Observation 2

After some time, the gpu_cache_usage_perc shoots up to 99 percent

Observation 3

Gradually, the num_requests_running comes down to 100 while the gpu_cache_usage_perc remains in the 99 percent range.

Observation 4

Gradually, the num_requests_running goes up to 256 while the gpu_cache_usage_perc remains in the 99 percent range.

Observation 3 and Observation 4 are repeated till all the requests are completed. The metrics num_requests_running and gpu_cache_usage_perc had to be correlated to understand that the requests are getting thrashed. It would be great if we can provide num_requests_preempted as this would give direct measure of thrashing.

sathyanarays commented 5 months ago

Might be related to request_with_evicted_tokens and total_evicted_tokens in https://github.com/vllm-project/vllm/issues/5041

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!