Open gilljon opened 4 months ago
Not sure if this is exactly what you were asking for, but we had a similar requirement and found an existing metrics-test in the vllm-project that apparently makes this possible already using an additional logger. We tested it and had success, i.e., the metrics are now also exposed via the /metrics
endpoint and can be queried using, e.g., Prometheus.
In short:
from typing import Optional, List, Union
from ray import serve
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.metrics import RayPrometheusStatLogger
from vllm.entrypoints.openai.serving_engine import LoRAModulePath
# ... make other necessary imports
app = FastAPI()
def get_served_model_names(engine_args: AsyncEngineArgs) -> List[str]:
if engine_args.served_model_name is not None:
served_model_names: Union[str, List[str]] = engine_args.served_model_name
# Because the typing suggests it could be a string or list of strings
if isinstance(served_model_names, str):
served_model_names: List[str] = [served_model_names]
else:
served_model_names: List[str] = [engine_args.model]
return served_model_names
@serve.deployment(name="VLLMDeployment")
@serve.ingress(app)
class VLLMDeployment:
def __init__(
self,
engine_args: AsyncEngineArgs,
response_role: str,
lora_modules: Optional[List[LoRAModulePath]] = None,
chat_template: Optional[str] = None,
):
self.response_role = response_role
self.lora_modules = lora_modules
self.chat_template = chat_template
self.engine = AsyncLLMEngine.from_engine_args(engine_args)
self.engine_args = engine_args
served_model_names: List[str] = get_served_model_names(self.engine_args)
additional_metrics_logger: RayPrometheusStatLogger = RayPrometheusStatLogger(
local_interval=0.5,
labels=dict(model_name=served_model_names[0]),
max_model_len=self.engine_args.max_model_len
)
self.engine.add_logger("ray", additional_metrics_logger)
Make sure that the engine argument --disable-log-stats
is set to False
.
Description
VLLM exposes a bunch of important metrics and we should make these accessible to downstream users. Currently, these metrics are not forwarded to the
/metrics
endpoint.Use case
VLLM metrics include LLM-specific metrics such as TTFT, e2e_request_latency_seconds, avg_prompt_throughput_toks_per_s. It would be fantastic to have these accessible in Ray Dashboards/Grafana.