Closed robertgshaw2-neuralmagic closed 2 weeks ago
Hi @robertgshaw2-neuralmagic I think the right place to ask this is KServe community. In the meantime here is my understanding. When qptext gets a request on 9088, it combines metrics from 9091 and the app port(vllm runtime for this case) and returns the aggregated metrics.
Now, you could create your own K8s service and point to the aggregated port. The service you are referring to above is an internal Knative service and only exposes the 9091 port. This does not stop you from exposing metrics to some other port and scrape it independently with a service monitor. You could do the same with port 9088.
Btw 9088 is the qptext aggregation port, what is the vllm runtime port from which the qptext will get the metrics (is it the default 8080)? Have you test if those ports work within the container, do you get any metrics back? Another question is whether you are using Istio or not as the latter provides metrics aggregation as it affects the setup.
Thanks @skonto - this is very helpful. I am somewhat new to Knative/KServe so I am trying to learn best practices for creating additional services vs updating configs of Knative/Kserve.
The vllm runtime uses port 8000
for both metrics
and user interaction API. I am going to change this.
Right now I have setup using Istio for client connections from outside the cluster. Since the Prometheus server is running inside my cluster, I was not going though Istio for metrics aggregation.
Would you suggest I use Istio for scraping the prom metrics as well?
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
Ask your question here:
Hello! I am working on an integration between Kserve/Knative with vLLM for deploying LLMs. vLLM is a production inference server for LLMs, and I have instrumented it with Prometheus metrics that are specific to LLM serving. For instance, the key items include TTFT (time-to-first-token) and TPOT (time-per-output-token). I want to use these metrics in addition to the generic metrics exposed by the
queue-proxy
container.KServe has a feature called
qpext
, which enables aggregation of thequeue-proxy
container metrics with thevllm
container metrics.qptext
exposes the aggregated metrics on port 9088 and exposes thequeue-proxy
metrics on port 9091. The issue I am running into is that when I create myInferenceService
(which uses Knative Serving), only port 9091 is exposed (this port is namedhttp-usermetric
):As a result, when I create a
ServiceMonitor
to monitor myInferenceService
, I am unable to query port9088
where the vLLM metrics are aggregated with thequeue-proxy
metrics.I am going to proceed by using
PodMonitor
for the time being, but I would prefer to use aServiceMonitor
as this seems like best practice after my review of the Prometheus Operator documentation.So my question is:
http-usermetrics
port that is exposed by the KNative services?Podmonitor
best practices for monitoring user-defined metrics from applications inside Knative?Apologies if this is the wrong place to ask this. I was not quite sure whether this made more sense to ask in the KServe or KNative forums.