[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

huggingface / text-generation-inference

Large Language Model Text Generation Inference

http://hf.co/docs/text-generation-inference

Apache License 2.0

9.17k stars 1.08k forks source link

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes #1977

Closed EandrewJones closed 3 months ago

EandrewJones commented 6 months ago

Feature request

TGI provides some valuable metrics on model performance and load today. However, there are still a number of missing metrics, the absence of which poses a challenge for orchestration and autoscaling in Kubernetes.

Here is a list of the metrics that we (K8s serving WG, see below) have identified for inclusion TGI:	Metric Name	Type	Unit
model_load_time	Counter	Seconds
time_per_output_token_per_batch_size	Histogram	Milliseconds
request_wait_time (total time - time spent on inference)	Hisogram	Milliseconds
request_queue_time	Histogram	Milliseconds	? (tgi_request_queue_duration)
max_token_capacity	Counter	Tokens
time_per_prefill_token	Histogram	Milliseconds
total_tokens_in_current_batch	Gauge	Tokens
time_to_first_token	Histogram	Milliseconds
estimated_max_prefill_tokens_per_second	Gauge	Tokens
estimated_max_batch_before_compute_saturation	Gauge	Tokens
request_input_length	Histogram	Tokens	$\checkmark$ (tgi_request_input_length)
request_output_length	Histogram	Tokens	$\checkmark$ (tgi_request_generated_tokens)
request_with_evicted_tokens	Counter	Count
total_evicted_tokens	Counter	Tokens

Additional Context

I believe TGI already uses OTel. OTel is in the process of adding support for LLM metrics which TGI may be able to piggyback off for some of the above. For reference, see OTel's LLM Semantic Convention WG (Please request access if you are not able to view it).

cc @Narsil @drbh

Motivation

If added, these metrics make it easier for orchestrators like Kubernetes to provide better support for autoscaling TGI servers or distributing load more efficiently. We have a proposal in the Kubernetes Serving WG to add these additional metrics to popular model servers. We want to add these to TGI as well.

Google doc link to the proposal which has the set of metrics we want to add and the reasoning behind it - https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk/edit?usp=sharing&resourcekey=0-ob5dR-AJxLQ5SvPlA4rdsg. (Please request access if you are not able to view it)

Your contribution

I am happy to shepherd this work from the K8s WG-side. I can contribute code where as my bandwidth permits and where it makes sense. That said, I am not yet super familiar with the TGI code base. It would be great to have one or more champions from the TGI contributor side as well.

Narsil commented 6 months ago

Metric Name	Type	Unit	Implemented by TGI Already
model_load_time	Counter	Seconds
time_per_output_token_per_batch_size	Histogram	Milliseconds	(Not implemented directly `tgi_request_mean_time_per_token_duration_bucket` is that, dividing per batch size is counter productive for understanding in our experience (batch_size also exists)(
request_wait_time (total time - time spent on inference)	Hisogram	Milliseconds	What's the difference with queue time ? a request is either in queue or in inference, no?
request_queue_time	Histogram	Milliseconds	? (tgi_request_queue_duration)
max_token_capacity	Counter	Tokens
time_per_prefill_token	Histogram	Milliseconds	tgi_batch_inference_duration_bucket
total_tokens_in_current_batch	Gauge	Tokens	tgi_batch_current_max_tokens
time_to_first_token	Histogram	Milliseconds	tgi_request_queue_duration_bucket + tgi_batch_inference_duration_bucket{method="prefill"}
estimated_max_prefill_tokens_per_second	Gauge	Tokens	(Already derivable from previous metrics)
estimated_max_batch_before_compute_saturation	Gauge	Tokens	(How do you derive this automatically?)
request_input_length	Histogram	Tokens	$\checkmark$ (tgi_request_input_length)
request_output_length	Histogram	Tokens	$\checkmark$ (tgi_request_generated_tokens)
request_with_evicted_tokens	Counter	Count	What are evicted tokens?
total_evicted_tokens	Counter	Tokens

I added a few and added some questions about what some of these are.

We provide a premade grafana dashboard which should include everything to monitor llm deployments. https://github.com/huggingface/text-generation-inference/blob/main/assets/tgi_grafana.json Doc; https://huggingface.co/docs/text-generation-inference/basic_tutorials/monitoring

Narsil commented 6 months ago

By the way, on the topic of monitoring, we're slowly but surely moving to a different schedule mechanism whose goal is to maximize compute occupancy.

https://github.com/huggingface/text-generation-inference/pull/1940

Basically we might not wait to pass new requests, but instead estimate theoretical compute capacity, and put part of requests as they come in, putting as many tokens as possible within compute bounds and VRAM usage bounds.

EandrewJones commented 6 months ago

@Narsil Thanks for the prompt response. I can respond here to some of your questions. However, I encourage you to add commentary to the working group document if you have time: https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk/edit?usp=sharing&resourcekey=0-ob5dR-AJxLQ5SvPlA4rdsg. The document might better answer your questions around the motivation and estimation methods for some of these metrics. And we've been hoping to get direct feedback from a TGI representative given it's popularity.

What's the difference with queue time ? a request is either in queue or in inference, no?

I suppose the answer to this depends on where you're doing your queuing and how you measure queue time. The spirit of the metric we proposed is to essentially capture model server overhead, which, again, depending on implementation could be a superset of queue time.

What are evicted tokens?

My understanding is this refers to tokens evicted from KV cache during decoding which, according to the doc, "gives us an idea of the current load / capacity of the model server. This also helps understand any latency regression and difference in TPOT between the one model server reports versus the one we observe." This needs more discussion IMO.

Admittedly, evicted tokens are a function of the attention algorithm which we intentionally choose to optimize for certain desired characteristics (your comment above about #1940). So I'm not sure how much the evicted token metrics tell us that we aren't intuitively aware of. Hence, why we've categorized it as in the lowest priority.

By the way, on the topic of monitoring, we're slowly but surely moving to a different schedule mechanism whose goal is to maximize compute occupancy.

Correct me if I'm wrong (I need to read more about FlashDecoding), but wouldn't this optimize throughput at the expense of latency? What if users don't want to optimize for throughput but rather TTFT?

achandrasekar commented 6 months ago

Thanks for taking a look @Narsil and the pointers to existing metrics and dashboards. And thanks @EandrewJones for driving this. To respond to some of the open questions:

(Not implemented directly tgi_request_mean_time_per_token_duration_bucket is that, dividing per batch size is counter productive for understanding in our experience (batch_size also exists)

Batch size and TPOT are available separately. But they are bucketized individually which makes deriving TPOT per batch size infeasible. The reason to have this info is to understand how batch size is affecting TPOT so we can configure it appropriately.

What's the difference with queue time ? a request is either in queue or in inference, no?

+1 to what @EandrewJones mentioned. There can be other overheads in the system like serializing/deserializing request, response. I'm not sure if TGI preempts requests to requeue them. But that is another case where we can see additional delays that is not captured just in queue time.

What are evicted tokens?

Evicted tokens are cases where we have to preempt/evict requests or some tokens of a request to accommodate others. I'm not sure if TGI does preemption at all. If that is not the case, this might not be applicable.

ttft -> tgi_request_queue_duration_bucket + tgi_batch_inference_duration_bucket{method="prefill"}

Since both queue duration and inference duration are histograms with different buckets, I'm not sure if TTFT can be inferred from this accurately. It would be good to have a new metric for that.

estimated_max_batch_before_compute_saturation -> How do you derive this automatically?

Agree, this has to come from outside the model server itself I think where we observe throughput at different batch sizes and see where it starts plateauing with latency continuing to increase.

Narsil commented 6 months ago

depending on implementation could be a superset of queue time.

Makes sense.

KV cache during decoding which, a

Okay, this doesn't happen in TGI. Essentially vllm is doing CPU swapping and that's what your inferring to I guess. We're trying not to add swap mecanisms as much as possible. Swapping tends to cause a lot of issues whenever you actually hit it. But that's a clear metric (I wanted to make sure it wasn't linked to truncation, or windowing in mistral for instance, which are other ways tokens get ignored).

Correct me if I'm wrong (I need to read more about FlashDecoding), but wouldn't this optimize throughput at the expense of latency? What if users don't want to optimize for throughput but rather TTFT?

It shouldn't. There are 3 regimes an llm inference server can operate AT (to simplify slightly):

CPU bound (sending kernel launches takes more time than the kernels themselves, this happens quicker than you'd expect with triton kernels). I'll ignore this for now.
Memory bound (The kernels spend their time fetching data from main VRAM)
Compute bound (all the core are at full capacity).

The sweep spot in terms of compute, is to be working right at the boundary of memory bound and compute bound. In that spot, you're at your best latency (mostly similar to singular user), and since you're compute bound you won't win anything by adding more load on your machine.

The idea we're looking into, is that the scheduler would estimate, how much "free compute" is still left on the GPU, and use any amount of new requests (or part of new requests if there are too many tokens in a request) to fill that "unused compute".

The priority space we're in is that:

Running queries have the highest prio, we shouldn't delay them (currently we're allowing constant time delay to allow more users, the idea is that this would become espilon). (The faster they are finished, the sooner we get the VRAM they use).
There's a small addendum is for extremely large long running requests (read 4k+ tokens generated) we're allowed to be less strict on rule 1. This particular user is using a lot of ressources, therefore maybe we need soft caps here (for models like 1M+ context length this becomes very important).
New users are first come first serve and served on best effort. (We should start serving them asap, but not if we're compute bound). When compute bound, adding more users just delays everyone by the same amount, resulting in worse experience for everyone, we'd rather let 1 guy wait, than delaying everyone which are seeing their requests being fulfilled.).
The person in charge of the deployment should be in charge of deciding whether to queue requests or applying backpressure.

Ofc nothing is as clear cut as this, and "compute bound" usually means highly diminishing returns on throughput rather than no returns at all, but the point stands, adding more users at some point is detrimental to most users. (Very similar idea to backpressure for standard webservers).

The whole idea is that as a server admin, I'd like to operate at compute/memory bound frontier, with either slight preference for latency (my users UX) or for throughput (my costs)

Narsil commented 6 months ago

Batch size and TPOT are available separately. But they are bucketized individually which makes deriving TPOT per batch size infeasible. The reason to have this info is to understand how batch size is affecting TPOT so we can configure it appropriately.

Interesting point. But then the metric is still lacking the dependency on the size of the request right ?

Running BS=2 (so 2 requests in flight) with past sequences of 32k tokens is likely to see a different TPOT than BS=2 with past lengths =10, no ? and then what do you make when there's 1 request with past=10 and the other with past=32k.

There's a case where the attention layer with all the past starts to dominate and even then measuring TPOT like you suggest is going to be quite noisy, no ?

What you're suggesting is definitely going to be not affected by the bucketing noise, but I'm not sure it'll capture the full picture either.

I'm mentioning this because it does seems like long context models is a real trend and a lot of assumptions for context lengths <8k are not valid anymore (Even at 32k we really seem to weird side effects)

EandrewJones commented 6 months ago

Thanks for the added context. Tracking 100%. To restate: it sounds like your aim is estimate the theoretical inflection point in your memory bound/cpu bound roofline model and schedule to keep yourself within some epsilon of that. The makes sense for achieving pareto optimal UX under fixed compute resources.

The idea we're looking into, is that the scheduler would estimate, how much "free compute" is still left on the GPU, and use any amount of new requests (or part of new requests if there are too many tokens in a request) to fill that "unused compute".

How do you intend to measure free compute?

Narsil commented 5 months ago

How do you intend to measure free compute?

Well the scheduler knows everything (past tokens for each query, number of running queries, available vram). The theoretical max is known from the hardware. The trick will be in how well it can be applied to various models(GQA, MOE, models, speculation etc..). And how far from theoretical max we're operating at.

But for sure with the added control in the kernels we should take advantage.

EandrewJones commented 5 months ago

Here's how we've been thinking about the challenges of auto-scaling a similar workloads aimed at "Best cost-effective latency on given hardware" which is essentially what you're proposing.

Choosing a metric and target To achieve the optimal concurrency, the scaling target metric needs to be guided by the derivative of throughput in tokens/sec over per token latency as the completed request rate increases. When it is zero, the workload is at the optimal target. If negative, the workload should be scaled down. If positive, the workload should be scaled up. We have experimentally observed this relationship while benchmarking vLLM and customers should be able to calculate it for a static traffic distribution. The resulting custom metric would be based on total requests to the server.

If a model server or intervening proxy can provide sufficiently accurate information, the derivative and optimal request rate for the current traffic can be calculated directly as a metric that then becomes the HPA scale objective. A rolling window for the metrics would be necessary proportional to traffic rate to ensure the target adapts to changing traffic conditions, which may also lead to oscillation.

Discussion of challenges Without a dynamic metric, this target is likely to drift from optimal as traffic changes (quickly) or when model server versions are rolled out and performance changes (slowly). If the drift is significant the result will be higher latency or an underutilized accelerator. The high compressibility of traffic would likely not result in a total failure, but the workload author would likely want to configure alerts on queueing.

An analysis of how broad the “sweet spot” is and whether it is sufficiently stable to target should be considered.

The above is informed heavily by vLLM, but your proposal is slightly different. My understanding is you'd expose a simpler set of metrics to scale on. Have you thought about how your approach translates into autoscaling strategy?

Here's a strategy based on my understanding of the implementation. Let's treat it as a straw man:

Using the aforementioned data from the scheduler, we can compute the delta between the the theoretical optimum and our actual usage. If the server is above the optimum, it has hit "excess load" state and start queuing responses. Then we have to choose what metrics we want to autoscale on, assuming we haven't already hit a pre-defined max cluster size.

Excess Load Strategy	Admissions Setting	Metric to Autoscale on
Queue	Max requests set to target	queue depth > 0
Queue	Max requests set above target	total requests (queue + active) > target

queue depth > 0 is simplest to reason about and act on since you don't have to set a target. Scaling up is straightforward using this metric. Scaling down, however, is less clear cut.

How do we use the optimal compute usage per node information to derive custom metrics for a scale down policy? I have a few thoughts but am curious about yours.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

achandrasekar commented 4 months ago

Commenting to keep it open. I see two metrics that are not available now that we should be able to add from the list here - https://github.com/huggingface/text-generation-inference/issues/1977#issuecomment-2144536581. One is model load time and the other is max token capacity. cc @Edwinhr716 since you were interested in this.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.