pentium3 / sys_reading

system paper reading notes
229 stars 12 forks source link

SpotServe: Serving Generative Large Language Models on Preemptible Instances #352

Closed pentium3 closed 3 months ago

pentium3 commented 3 months ago

https://arxiv.org/pdf/2311.15566.pdf

pentium3 commented 3 months ago

summary

key problem

workload

single LLM serving (may or may not be able to fit in one instance) in the cluster

Given a batch of input requests, the corresponding execution latency $l{exe} \approx t{exe}(S{in}) + S{out}*t{exe}(1)$. Here $S{in}$ is the sequence length of the input tokens provided the users, and $S_{out}$ is the sequence length of output tokens the generated by the LLM. [ch2.1]

For example, if input prompt is "ABCD" and LLM outputs "EFG",

image

optimization goal

optimize end-to-end inference latency, while using spot instance to reduce cost.

configurations to tune

scenario

Preemptible Instances (spot instance) in cloud

technique

graph algorithm, enumerate

dynamic workload?

yes

multi-tenant?

no. single LLM model

implementation

inference engine over FasterTransformer

controller/server in c++/python

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

serving LLM over spot instances could be a worthwhile attempt, since it could reduce cost. existing techniques are designed for distributed DNN training (on spot instance) and do not apply to generative LLM serving

comments: in my opinion, it makes more sense to train DNN in dedicated cluster instead of unreliable spot instance, since training job typically have static throughput and need reliable infrastructure to avoid interrupt/failure. However, doing inference on spot instance makes sense to me. inference task has more unpredictable rate, and could tolerate some failure.

why is it challenging?

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

image

Solution description

explain how the solution work

Parallelization Controller [ch3.2]

workload-aware adaptive configuration optimization algorithm: calculate a new parallelization plan when input rate / num of resources change

run it when the current serving capability is not compatible with $a_t$ due to changes in instances’ availiability or serving workload.

at time $t$:

Algorithm: given $C_t , N_t , at$, decide $C{t+1}$

image

The author claims this optimizer can finish within 1s since the latency estimation of different configurations is done offline in advance.
comment: really? If all estimations are profiled in advance, it might be possible to enumerate all plans in several seconds. But not sure how large/accurate the profile could be.

Device Mapper and Migration Planner

identify a plan to migrating instances to execute new target parallelization plan $C_{t+1}$, aiming to reuse the model parameters and KV cache available on existing GPU instances.

Device Mapper

find a mapping (migration plan) from GPU to model partitions under the new parallelization plan, with lowest amount of data transmission (i.e. maximize locality)

problem formulation: construct a bipartite graph (二分图) $G=(V_a, V_t, E)$ :

Then SpotServe transforms the optimal device mapping problem to a bipartite graph matching task and uses the KM algorithm to find a maximum-weight match, which maximally reuses the model parameters and KV cache on available GPU instances and minimizes the total data transmission.

image

Example: in fig 4b, $u_i$ indicates which part of the model is on which GPU, under existing parallelization plan $C_t$. $vi$ indicates the new model partition plan under $C{t+1}$. Here we prefer to match $u_1$ with $v_0$ as it has more cache context to use.

comment: similar idea to leverage locality of model context, as in #320

Migration Planner

how to execute the migration plan with lowest overhead.

progressive migration schedule that utilizes the pipeline structure and prioritize the migration of front model layers’ context. Then the front pipeline stages’ instances can start serving, which can be potentially overlapped with the following stages’ migration.

Memory Optimized Migration consider the memory usage during the progressive migration process. selects the layer whose context migration can minimize the maximum instance buffer memory usage.

comments: ?

Stateful Inference Recovery

decide when to terminate the inference engine and start the context migration for each GPU instance

Suppose a batch of input requests are ready to serve at time 𝑡 , SpotServe determines the number of the decoding iterations 𝑆𝑡 to serve before migrating to new plan $C_{t+1}$ :

So after Parallelization Controller calculate a new plan (I assume periodically), and Migration Planner calculate a migration plan and estimated migration time, SpotServe will determines the number of the decoding iterations 𝑆𝑡 to continue serve, before migrating to new plan. During serving these 𝑆𝑡 tokens, SpotServe can do some preparation in background (eg, init new instances, deploy model parameter to new instances) After finish serving 𝑆𝑡 tokens, start migration. (copy KV cache to new instances. may lead to some downtime here) The Stateful Inference Recovery ensured that migration could finish before grace period expires.

Important results

describe the experimental setup
summarize the main results

setup

model:

image

spot instance:

collect a real 12-hour availability trace (eg, how many spot instances available over time, when does a preemption happen) with AWS g4dn spot instance, and replay the traces on AWS g4dn.12xlarge instances.

workload rate: A_S, BS, A{S+O}, B_{S+O} . see Fig5.

baseline

request rerouting: reroutes interrupted requests to other available pipelines when preemption happens. keeps fixed predefined optimal model parallel configuration and drops/adds the inference pipeline adaptively.

reparallelization: changes the parallel configuration like ours, but has to restart and reinitialize all instance without context migration.

result

static workload rate + spot instance changing dynamically

dynamic workload rate (𝐴′𝑆+𝑂 and 𝐵′𝑆+𝑂) + spot instance changing dynamically

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

Closely related work

list of main competitors and how they differ

severless functions are designed to be lightweight with limited computational power, memory and storage, and hard be provisioned with GPUs [14]. And serverless functions cannot directly communicate with each other, which is also necessary to support distributed inference of LLMs.

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work
pentium3 commented 3 months ago

[ch2.1]

as the output sequence grows longer, the memory space of KV cache keeps expanding, which can be huge in real workloads (i.e., 1.7 GB per-sequence in LLaMA-13B [7], or even terabytes in OPT-175B [37]).

pentium3 commented 2 months ago

https://docs.google.com/presentation/d/1sxrYpnOIfgl5z0bVHPZDobtBqb5McV7DabBH2VFiSYI/edit?usp=sharing