ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.02k stars 5.78k forks source link

[Serve] Expose internal VLLM metrics #46360

Open gilljon opened 4 months ago

gilljon commented 4 months ago

Description

VLLM exposes a bunch of important metrics and we should make these accessible to downstream users. Currently, these metrics are not forwarded to the /metrics endpoint.

Use case

VLLM metrics include LLM-specific metrics such as TTFT, e2e_request_latency_seconds, avg_prompt_throughput_toks_per_s. It would be fantastic to have these accessible in Ray Dashboards/Grafana.

mcd01 commented 1 day ago

Not sure if this is exactly what you were asking for, but we had a similar requirement and found an existing metrics-test in the vllm-project that apparently makes this possible already using an additional logger. We tested it and had success, i.e., the metrics are now also exposed via the /metrics endpoint and can be queried using, e.g., Prometheus.

In short:

from typing import Optional, List, Union
from ray import serve
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.metrics import RayPrometheusStatLogger
from vllm.entrypoints.openai.serving_engine import LoRAModulePath
# ... make other necessary imports

app = FastAPI()

def get_served_model_names(engine_args: AsyncEngineArgs) -> List[str]:
    if engine_args.served_model_name is not None:
        served_model_names: Union[str, List[str]] = engine_args.served_model_name
        # Because the typing suggests it could be a string or list of strings
        if isinstance(served_model_names, str):
            served_model_names: List[str] = [served_model_names]
    else:
        served_model_names: List[str] = [engine_args.model]
    return served_model_names

@serve.deployment(name="VLLMDeployment")
@serve.ingress(app)
class VLLMDeployment:
    def __init__(
            self,
            engine_args: AsyncEngineArgs,
            response_role: str,
            lora_modules: Optional[List[LoRAModulePath]] = None,
            chat_template: Optional[str] = None,
    ):
        self.response_role = response_role
        self.lora_modules = lora_modules
        self.chat_template = chat_template
        self.engine = AsyncLLMEngine.from_engine_args(engine_args)
        self.engine_args = engine_args
        served_model_names: List[str] = get_served_model_names(self.engine_args)
        additional_metrics_logger: RayPrometheusStatLogger = RayPrometheusStatLogger(
            local_interval=0.5,
            labels=dict(model_name=served_model_names[0]),
            max_model_len=self.engine_args.max_model_len
        )
        self.engine.add_logger("ray", additional_metrics_logger)

Make sure that the engine argument --disable-log-stats is set to False.