Closed EandrewJones closed 3 months ago
Metric Name | Type | Unit | Implemented by TGI Already |
---|---|---|---|
model_load_time | Counter | Seconds | |
time_per_output_token_per_batch_size | Histogram | Milliseconds | (Not implemented directly tgi_request_mean_time_per_token_duration_bucket is that, dividing per batch size is counter productive for understanding in our experience (batch_size also exists)( |
request_wait_time (total time - time spent on inference) | Hisogram | Milliseconds | What's the difference with queue time ? a request is either in queue or in inference, no? |
request_queue_time | Histogram | Milliseconds | ? (tgi_request_queue_duration) |
max_token_capacity | Counter | Tokens | |
time_per_prefill_token | Histogram | Milliseconds | tgi_batch_inference_duration_bucket |
total_tokens_in_current_batch | Gauge | Tokens | tgi_batch_current_max_tokens |
time_to_first_token | Histogram | Milliseconds | tgi_request_queue_duration_bucket + tgi_batch_inference_duration_bucket{method="prefill"} |
estimated_max_prefill_tokens_per_second | Gauge | Tokens | (Already derivable from previous metrics) |
estimated_max_batch_before_compute_saturation | Gauge | Tokens | (How do you derive this automatically?) |
request_input_length | Histogram | Tokens | $\checkmark$ (tgi_request_input_length) |
request_output_length | Histogram | Tokens | $\checkmark$ (tgi_request_generated_tokens) |
request_with_evicted_tokens | Counter | Count | What are evicted tokens? |
total_evicted_tokens | Counter | Tokens |
I added a few and added some questions about what some of these are.
We provide a premade grafana dashboard which should include everything to monitor llm deployments. https://github.com/huggingface/text-generation-inference/blob/main/assets/tgi_grafana.json Doc; https://huggingface.co/docs/text-generation-inference/basic_tutorials/monitoring
By the way, on the topic of monitoring, we're slowly but surely moving to a different schedule mechanism whose goal is to maximize compute occupancy.
https://github.com/huggingface/text-generation-inference/pull/1940
Basically we might not wait to pass new requests, but instead estimate theoretical compute capacity, and put part of requests as they come in, putting as many tokens as possible within compute bounds and VRAM usage bounds.
@Narsil Thanks for the prompt response. I can respond here to some of your questions. However, I encourage you to add commentary to the working group document if you have time: https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk/edit?usp=sharing&resourcekey=0-ob5dR-AJxLQ5SvPlA4rdsg. The document might better answer your questions around the motivation and estimation methods for some of these metrics. And we've been hoping to get direct feedback from a TGI representative given it's popularity.
What's the difference with queue time ? a request is either in queue or in inference, no?
I suppose the answer to this depends on where you're doing your queuing and how you measure queue time. The spirit of the metric we proposed is to essentially capture model server overhead, which, again, depending on implementation could be a superset of queue time.
What are evicted tokens?
My understanding is this refers to tokens evicted from KV cache during decoding which, according to the doc, "gives us an idea of the current load / capacity of the model server. This also helps understand any latency regression and difference in TPOT between the one model server reports versus the one we observe." This needs more discussion IMO.
Admittedly, evicted tokens are a function of the attention algorithm which we intentionally choose to optimize for certain desired characteristics (your comment above about #1940). So I'm not sure how much the evicted token metrics tell us that we aren't intuitively aware of. Hence, why we've categorized it as in the lowest priority.
By the way, on the topic of monitoring, we're slowly but surely moving to a different schedule mechanism whose goal is to maximize compute occupancy.
Correct me if I'm wrong (I need to read more about FlashDecoding), but wouldn't this optimize throughput at the expense of latency? What if users don't want to optimize for throughput but rather TTFT?
Thanks for taking a look @Narsil and the pointers to existing metrics and dashboards. And thanks @EandrewJones for driving this. To respond to some of the open questions:
(Not implemented directly tgi_request_mean_time_per_token_duration_bucket is that, dividing per batch size is counter productive for understanding in our experience (batch_size also exists)
Batch size and TPOT are available separately. But they are bucketized individually which makes deriving TPOT per batch size infeasible. The reason to have this info is to understand how batch size is affecting TPOT so we can configure it appropriately.
What's the difference with queue time ? a request is either in queue or in inference, no?
+1 to what @EandrewJones mentioned. There can be other overheads in the system like serializing/deserializing request, response. I'm not sure if TGI preempts requests to requeue them. But that is another case where we can see additional delays that is not captured just in queue time.
What are evicted tokens?
Evicted tokens are cases where we have to preempt/evict requests or some tokens of a request to accommodate others. I'm not sure if TGI does preemption at all. If that is not the case, this might not be applicable.
ttft -> tgi_request_queue_duration_bucket + tgi_batch_inference_duration_bucket{method="prefill"}
Since both queue duration and inference duration are histograms with different buckets, I'm not sure if TTFT can be inferred from this accurately. It would be good to have a new metric for that.
estimated_max_batch_before_compute_saturation -> How do you derive this automatically?
Agree, this has to come from outside the model server itself I think where we observe throughput at different batch sizes and see where it starts plateauing with latency continuing to increase.
depending on implementation could be a superset of queue time.
Makes sense.
KV cache during decoding which, a
Okay, this doesn't happen in TGI. Essentially vllm is doing CPU swapping and that's what your inferring to I guess. We're trying not to add swap mecanisms as much as possible. Swapping tends to cause a lot of issues whenever you actually hit it. But that's a clear metric (I wanted to make sure it wasn't linked to truncation, or windowing in mistral for instance, which are other ways tokens get ignored).
Correct me if I'm wrong (I need to read more about FlashDecoding), but wouldn't this optimize throughput at the expense of latency? What if users don't want to optimize for throughput but rather TTFT?
It shouldn't. There are 3 regimes an llm inference server can operate AT (to simplify slightly):
The sweep spot in terms of compute, is to be working right at the boundary of memory bound and compute bound. In that spot, you're at your best latency (mostly similar to singular user), and since you're compute bound you won't win anything by adding more load on your machine.
The idea we're looking into, is that the scheduler would estimate, how much "free compute" is still left on the GPU, and use any amount of new requests (or part of new requests if there are too many tokens in a request) to fill that "unused compute".
The priority space we're in is that:
Ofc nothing is as clear cut as this, and "compute bound" usually means highly diminishing returns on throughput rather than no returns at all, but the point stands, adding more users at some point is detrimental to most users. (Very similar idea to backpressure for standard webservers).
The whole idea is that as a server admin, I'd like to operate at compute/memory bound frontier, with either slight preference for latency (my users UX) or for throughput (my costs)
Batch size and TPOT are available separately. But they are bucketized individually which makes deriving TPOT per batch size infeasible. The reason to have this info is to understand how batch size is affecting TPOT so we can configure it appropriately.
Interesting point. But then the metric is still lacking the dependency on the size of the request right ?
Running BS=2 (so 2 requests in flight) with past sequences of 32k tokens is likely to see a different TPOT than BS=2 with past lengths =10, no ? and then what do you make when there's 1 request with past=10 and the other with past=32k.
There's a case where the attention layer with all the past starts to dominate and even then measuring TPOT like you suggest is going to be quite noisy, no ?
What you're suggesting is definitely going to be not affected by the bucketing noise, but I'm not sure it'll capture the full picture either.
I'm mentioning this because it does seems like long context models is a real trend and a lot of assumptions for context lengths <8k are not valid anymore (Even at 32k we really seem to weird side effects)
Thanks for the added context. Tracking 100%. To restate: it sounds like your aim is estimate the theoretical inflection point in your memory bound/cpu bound roofline model and schedule to keep yourself within some epsilon of that. The makes sense for achieving pareto optimal UX under fixed compute resources.
The idea we're looking into, is that the scheduler would estimate, how much "free compute" is still left on the GPU, and use any amount of new requests (or part of new requests if there are too many tokens in a request) to fill that "unused compute".
How do you intend to measure free compute?
How do you intend to measure free compute?
Well the scheduler knows everything (past tokens for each query, number of running queries, available vram). The theoretical max is known from the hardware. The trick will be in how well it can be applied to various models(GQA, MOE, models, speculation etc..). And how far from theoretical max we're operating at.
But for sure with the added control in the kernels we should take advantage.
Here's how we've been thinking about the challenges of auto-scaling a similar workloads aimed at "Best cost-effective latency on given hardware" which is essentially what you're proposing.
Choosing a metric and target To achieve the optimal concurrency, the scaling target metric needs to be guided by the derivative of throughput in tokens/sec over per token latency as the completed request rate increases. When it is zero, the workload is at the optimal target. If negative, the workload should be scaled down. If positive, the workload should be scaled up. We have experimentally observed this relationship while benchmarking vLLM and customers should be able to calculate it for a static traffic distribution. The resulting custom metric would be based on total requests to the server.
If a model server or intervening proxy can provide sufficiently accurate information, the derivative and optimal request rate for the current traffic can be calculated directly as a metric that then becomes the HPA scale objective. A rolling window for the metrics would be necessary proportional to traffic rate to ensure the target adapts to changing traffic conditions, which may also lead to oscillation.
Discussion of challenges Without a dynamic metric, this target is likely to drift from optimal as traffic changes (quickly) or when model server versions are rolled out and performance changes (slowly). If the drift is significant the result will be higher latency or an underutilized accelerator. The high compressibility of traffic would likely not result in a total failure, but the workload author would likely want to configure alerts on queueing.
An analysis of how broad the “sweet spot” is and whether it is sufficiently stable to target should be considered.
The above is informed heavily by vLLM, but your proposal is slightly different. My understanding is you'd expose a simpler set of metrics to scale on. Have you thought about how your approach translates into autoscaling strategy?
Here's a strategy based on my understanding of the implementation. Let's treat it as a straw man:
Using the aforementioned data from the scheduler, we can compute the delta between the the theoretical optimum and our actual usage. If the server is above the optimum, it has hit "excess load" state and start queuing responses. Then we have to choose what metrics we want to autoscale on, assuming we haven't already hit a pre-defined max cluster size.
Excess Load Strategy | Admissions Setting | Metric to Autoscale on |
---|---|---|
Queue | Max requests set to target | queue depth > 0 |
Queue | Max requests set above target | total requests (queue + active) > target |
queue depth > 0
is simplest to reason about and act on since you don't have to set a target. Scaling up is straightforward using this metric. Scaling down, however, is less clear cut.
How do we use the optimal compute usage per node information to derive custom metrics for a scale down policy? I have a few thoughts but am curious about yours.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Commenting to keep it open. I see two metrics that are not available now that we should be able to add from the list here - https://github.com/huggingface/text-generation-inference/issues/1977#issuecomment-2144536581. One is model load time and the other is max token capacity. cc @Edwinhr716 since you were interested in this.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Feature request
TGI provides some valuable metrics on model performance and load today. However, there are still a number of missing metrics, the absence of which poses a challenge for orchestration and autoscaling in Kubernetes.
Additional Context
I believe TGI already uses OTel. OTel is in the process of adding support for LLM metrics which TGI may be able to piggyback off for some of the above. For reference, see OTel's LLM Semantic Convention WG (Please request access if you are not able to view it).
cc @Narsil @drbh
Motivation
If added, these metrics make it easier for orchestrators like Kubernetes to provide better support for autoscaling TGI servers or distributing load more efficiently. We have a proposal in the Kubernetes Serving WG to add these additional metrics to popular model servers. We want to add these to TGI as well.
Google doc link to the proposal which has the set of metrics we want to add and the reasoning behind it - https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk/edit?usp=sharing&resourcekey=0-ob5dR-AJxLQ5SvPlA4rdsg. (Please request access if you are not able to view it)
Your contribution
I am happy to shepherd this work from the K8s WG-side. I can contribute code where as my bandwidth permits and where it makes sense. That said, I am not yet super familiar with the TGI code base. It would be great to have one or more champions from the TGI contributor side as well.