summary

key problem

workload

single LLM serving (may or may not be able to fit in one instance) in the cluster

Given a batch of input requests, the corresponding execution latency $l{exe} \approx t{exe}(S{in}) + S{out}*t{exe}(1)$. Here $S{in}$ is the sequence length of the input tokens provided the users, and $S_{out}$ is the sequence length of output tokens the generated by the LLM. [ch2.1]

For example, if input prompt is "ABCD" and LLM outputs "EFG",

optimization goal

optimize end-to-end inference latency, while using spot instance to reduce cost.

configurations to tune

dynamic change parallelization plan(combining data, tensor model, and pipeline model parallelism) to quickly adapt to changes to spot instances’ availability and requests’ arrival rates.
minimizing the cost of migrating GPU instances when changing parallelization plan
when spot instance is going to be unavailable, utilize grace period to finish migration, while trying to keep service online

scenario

Preemptible Instances (spot instance) in cloud

technique

graph algorithm, enumerate

dynamic workload?

yes

multi-tenant?

no. single LLM model

implementation

inference engine over FasterTransformer

controller/server in c++/python

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

serving LLM over spot instances could be a worthwhile attempt, since it could reduce cost. existing techniques are designed for distributed DNN training (on spot instance) and do not apply to generative LLM serving

comments: in my opinion, it makes more sense to train DNN in dedicated cluster instead of unreliable spot instance, since training job typically have static throughput and need reliable infrastructure to avoid interrupt/failure. However, doing inference on spot instance makes sense to me. inference task has more unpredictable rate, and could tolerate some failure.

why is it challenging?

spot instance might be preempted anytime.
if the large model is replicated/partitioned on multiple instances, preemption of one instance might make other instances idle [ch 2.3, Fig 1b], and cause high latency and service interruptions.
when new instance joins, there are necessary system initialization costs

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

to alleviate the waiting time caused by the integration of new instances, SpotServe facilitates the integration of on-demand instances to ensure swift instance acquisition.
to reduce the runtime overhead of system re-initialization, SpotServe introduces an efficient context management mechanism that leverages inter-instance network links to preserve inference progress (in the form of KV cache) and obviate the need for expensive model parameter reloading.
to strike a better balance among serving throughput, latency, and monetary cost during node availability fluctuations, SpotServe incorporates a workload-aware adaptive configuration optimization algorithm, which dynamically selects an optimal parallel configuration, enabling real-time dynamic context migration and seamless configuration transitions

Solution description

explain how the solution work

Parallelization Controller [ch3.2]

workload-aware adaptive configuration optimization algorithm: calculate a new parallelization plan when input rate / num of resources change

run it when the current serving capability is not compatible with $a_t$ due to changes in instances’ availiability or serving workload.

at time $t$:

$C_t$ = (D,P,M,B)$: parallelization configuration at time $t$
- 𝐷, 𝑃, and 𝑀 indicate the data, pipeline-model and tensor-model parallelism degrees, and 𝐵 is the maximum mini-batch size.
$N_t$: num of available instances at time $t$
$a_t$: estimated request arrival rate at time $t$
$\phi (C)$: (estimated) serving throughput with the parallel configuration $C$
- by observing the request arrivals within a short past duration (e.g., 30s).
$l_{req} (C)$: (estimated) serving latency with the parallel configuration $C$
- comment: how to estimate latency? sec2.1 (and #320 ) mentioned that LLM serving latency depends on prompt size and not that predictable
- [ch5] they design a cost model and implement a offline profile over SpotServe to estimate the required inference latency, system throughput and the context migration overheads in advance.

Algorithm: given $C_t , N_t , at$, decide $C{t+1}$

line 2: enumerate all possible plan $C$ and find feasible (can satisify throughput $a_t$ and cloud has enough instances for it) plans
line 3: if so, select the one $C$ with minimum estimated latency.
line 5: otherwise, just maximize the overall serving throughput with current available resources $N_t$
line 6-10 allocate/free instance according to selected plan $C_{t+1}$.
- line 8: when allocating new instance, first allocate spot instances. if no spot instances available now, allocate on-demand VM
- line 10: when free existing instance, first free on-demand VM.
line 11: update configuration (but not immediately migrate to the new plan. see ### Stateful Inference Recovery )

The author claims this optimizer can finish within 1s since the latency estimation of different configurations is done offline in advance.
comment: really? If all estimations are profiled in advance, it might be possible to enumerate all plans in several seconds. But not sure how large/accurate the profile could be.

Device Mapper and Migration Planner

identify a plan to migrating instances to execute new target parallelization plan $C_{t+1}$, aiming to reuse the model parameters and KV cache available on existing GPU instances.

Device Mapper

find a mapping (migration plan) from GPU to model partitions under the new parallelization plan, with lowest amount of data transmission (i.e. maximize locality)

problem formulation: construct a bipartite graph (二分图) $G=(V_a, V_t, E)$ :

each node $u \in V_a$ is a GPU device
each node $v \in V_t$ is a pipeline-stage-shard position (model partition) of the parallel configuration
each weighted edge $e_{uv} (u \in V_a, v \in V_t)$ indicates the amount of reusable model parameters and key/value cache when mapping GPU 𝑢 to position 𝑣 of the parallel configuration.

Then SpotServe transforms the optimal device mapping problem to a bipartite graph matching task and uses the KM algorithm to find a maximum-weight match, which maximally reuses the model parameters and KV cache on available GPU instances and minimizes the total data transmission.

Example: in fig 4b, $u_i$ indicates which part of the model is on which GPU, under existing parallelization plan $C_t$. $vi$ indicates the new model partition plan under $C{t+1}$. Here we prefer to match $u_1$ with $v_0$ as it has more cache context to use.

comment: similar idea to leverage locality of model context, as in #320

Migration Planner

how to execute the migration plan with lowest overhead.

progressive migration schedule that utilizes the pipeline structure and prioritize the migration of front model layers’ context. Then the front pipeline stages’ instances can start serving, which can be potentially overlapped with the following stages’ migration.

Memory Optimized Migration consider the memory usage during the progressive migration process. selects the layer whose context migration can minimize the maximum instance buffer memory usage.

comments: ?

Stateful Inference Recovery

decide when to terminate the inference engine and start the context migration for each GPU instance

Suppose a batch of input requests are ready to serve at time 𝑡 , SpotServe determines the number of the decoding iterations 𝑆𝑡 to serve before migrating to new plan $C_{t+1}$ :

If preemption happens and given a grace period, calculate the maximum num of tokens it can continue serve, such that we can guarantee to finish serving $S_t$ tokens and migration before grace period finish. $St = max (S) , s.t. [ l{exe} (S | Ct) < T^- - T{mig} ] $
If new plan will add new instance, calculate the minimum num of tokens to continue serve, in order to cover the time spent on new instance to initialize. So we can move to new instance without downtime. $St = min(S), s.t. [ l{exe}(S | C_t) \geq T^+ ] $
Since SpotServe’s context daemon maintains the state (i.e., cache context) of an inference request, the request can be rerouted to another inference pipeline, which can directly continue its inference using the cached state without recomputing previously generated output tokens

So after Parallelization Controller calculate a new plan (I assume periodically), and Migration Planner calculate a migration plan and estimated migration time, SpotServe will determines the number of the decoding iterations 𝑆𝑡 to continue serve, before migrating to new plan. During serving these 𝑆𝑡 tokens, SpotServe can do some preparation in background (eg, init new instances, deploy model parameter to new instances) After finish serving 𝑆𝑡 tokens, start migration. (copy KV cache to new instances. may lead to some downtime here) The Stateful Inference Recovery ensured that migration could finish before grace period expires.

Important results

describe the experimental setup
summarize the main results

setup

model:

spot instance:

collect a real 12-hour availability trace (eg, how many spot instances available over time, when does a preemption happen) with AWS g4dn spot instance, and replay the traces on AWS g4dn.12xlarge instances.

workload rate: A_S, BS, A{S+O}, B_{S+O} . see Fig5.

baseline

request rerouting: reroutes interrupted requests to other available pipelines when preemption happens. keeps fixed predefined optimal model parallel configuration and drops/adds the inference pipeline adaptively.

only change data parallelism adaptively

reparallelization: changes the parallel configuration like ours, but has to restart and reinitialize all instance without context migration.

result

static workload rate + spot instance changing dynamically

Fig6: latency CDF
Fig7: the per-token costs and p99/avg latency
Fig9: latency increasing when disabling each component of SpotServe. parallelization controller seems to be the most important factor.

dynamic workload rate (𝐴′𝑆+𝑂 and 𝐵′𝑆+𝑂) + spot instance changing dynamically

Fig 8g-8h: latency over time.
Reparallelization is suffering from expensive restarting overheads, resulting in the highest peak latency.
Rerouting only changes the number of pipelines and incurs some request waiting overheads

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

rely on grace period pre-notified by cloud provider
homogeneous configuration for all spot instance
single LLM model
heavily rely on estimations of throughput/latency/migration time/... but did not discuss how they do offline profiling to get the estimation.

Closely related work

list of main competitors and how they differ

severless functions are designed to be lightweight with limited computational power, memory and storage, and hard be provisioned with GPUs [14]. And serverless functions cannot directly communicate with each other, which is also necessary to support distributed inference of LLMs.

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work

when deciding new parallelization plan (algorithm 1), they only aiming to satisfy throughput/latency requirement. After getting the new parallelization plan, consider the migration cost as a second task.
- what if also considering migration cost when deciding new parallelization plan? I assume in this way we can find a feasible plan with even lower migration cost.
- calculate migration cost (basically time to move files between instance) should be easier than estimating latency/throughput.
I suspect algorithm 1 is not sufficient enough (since it's enumerating all parallelization plan), especially if there are multiple models.
very similar to Flink state migration?
- KV cache -> rocksdb for stateful operator
- goal: want to reconfigure job (parallelism + placement) with low state migration cost
- Flink has backpressure/busytime mechanism, which can quickly show if dataflow is currently under high workload, and thus DS2 can make decision without any estimation on throughput. Can we add such mechanism for ML inference?
are there any other preemptible resources on cloud?
- eg: multiplex multiple LLM inference in same cluster. preempt low priority requests

pentium3 / sys_reading

SpotServe: Serving Generative Large Language Models on Preemptible Instances #352