Closed pentium3 closed 8 months ago
single LLM serving (may or may not be able to fit in one instance) in the cluster
Given a batch of input requests, the corresponding execution latency $l{exe} \approx t{exe}(S{in}) + S{out}*t{exe}(1)$. Here $S{in}$ is the sequence length of the input tokens provided the users, and $S_{out}$ is the sequence length of output tokens the generated by the LLM. [ch2.1]
For example, if input prompt is "ABCD" and LLM outputs "EFG",
optimize end-to-end inference latency, while using spot instance to reduce cost.
Preemptible Instances (spot instance) in cloud
graph algorithm, enumerate
yes
no. single LLM model
inference engine over FasterTransformer
controller/server in c++/python
what is the problem this paper is solving?
why is it important?
why is it challenging?
serving LLM over spot instances could be a worthwhile attempt, since it could reduce cost. existing techniques are designed for distributed DNN training (on spot instance) and do not apply to generative LLM serving
comments: in my opinion, it makes more sense to train DNN in dedicated cluster instead of unreliable spot instance, since training job typically have static throughput and need reliable infrastructure to avoid interrupt/failure. However, doing inference on spot instance makes sense to me. inference task has more unpredictable rate, and could tolerate some failure.
why is it challenging?
describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?
explain how the solution work
workload-aware adaptive configuration optimization algorithm: calculate a new parallelization plan when input rate / num of resources change
run it when the current serving capability is not compatible with $a_t$ due to changes in instances’ availiability or serving workload.
at time $t$:
Algorithm: given $C_t , N_t , at$, decide $C{t+1}$
The author claims this optimizer can finish within 1s since the latency estimation of different configurations is done offline in advance.
comment: really?
If all estimations are profiled in advance, it might be possible to enumerate all plans in several seconds. But not sure how large/accurate the profile could be.
identify a plan to migrating instances to execute new target parallelization plan $C_{t+1}$, aiming to reuse the model parameters and KV cache available on existing GPU instances.
find a mapping (migration plan) from GPU to model partitions under the new parallelization plan, with lowest amount of data transmission (i.e. maximize locality)
problem formulation: construct a bipartite graph (二分图) $G=(V_a, V_t, E)$ :
Then SpotServe transforms the optimal device mapping problem to a bipartite graph matching task and uses the KM algorithm to find a maximum-weight match, which maximally reuses the model parameters and KV cache on available GPU instances and minimizes the total data transmission.
Example: in fig 4b, $u_i$ indicates which part of the model is on which GPU, under existing parallelization plan $C_t$. $vi$ indicates the new model partition plan under $C{t+1}$. Here we prefer to match $u_1$ with $v_0$ as it has more cache context to use.
comment: similar idea to leverage locality of model context, as in #320
how to execute the migration plan with lowest overhead.
progressive migration schedule that utilizes the pipeline structure and prioritize the migration of front model layers’ context. Then the front pipeline stages’ instances can start serving, which can be potentially overlapped with the following stages’ migration.
Memory Optimized Migration consider the memory usage during the progressive migration process. selects the layer whose context migration can minimize the maximum instance buffer memory usage.
comments: ?
decide when to terminate the inference engine and start the context migration for each GPU instance
Suppose a batch of input requests are ready to serve at time 𝑡 , SpotServe determines the number of the decoding iterations 𝑆𝑡 to serve before migrating to new plan $C_{t+1}$ :
So after Parallelization Controller calculate a new plan (I assume periodically), and Migration Planner calculate a migration plan and estimated migration time, SpotServe will determines the number of the decoding iterations 𝑆𝑡 to continue serve, before migrating to new plan. During serving these 𝑆𝑡 tokens, SpotServe can do some preparation in background (eg, init new instances, deploy model parameter to new instances) After finish serving 𝑆𝑡 tokens, start migration. (copy KV cache to new instances. may lead to some downtime here) The Stateful Inference Recovery ensured that migration could finish before grace period expires.
describe the experimental setup
summarize the main results
model:
spot instance:
collect a real 12-hour availability trace (eg, how many spot instances available over time, when does a preemption happen) with AWS g4dn spot instance, and replay the traces on AWS g4dn.12xlarge instances.
workload rate: A_S, BS, A{S+O}, B_{S+O} . see Fig5.
request rerouting: reroutes interrupted requests to other available pipelines when preemption happens. keeps fixed predefined optimal model parallel configuration and drops/adds the inference pipeline adaptively.
reparallelization: changes the parallel configuration like ours, but has to restart and reinitialize all instance without context migration.
static workload rate + spot instance changing dynamically
dynamic workload rate (𝐴′𝑆+𝑂 and 𝐵′𝑆+𝑂) + spot instance changing dynamically
when doesn't it work?
what assumptions does the paper make and when are they valid?
list of main competitors and how they differ
severless functions are designed to be lightweight with limited computational power, memory and storage, and hard be provisioned with GPUs [14]. And serverless functions cannot directly communicate with each other, which is also necessary to support distributed inference of LLMs.
If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:
Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work
[ch2.1]
as the output sequence grows longer, the memory space of KV cache keeps expanding, which can be huge in real workloads (i.e., 1.7 GB per-sequence in LLaMA-13B [7], or even terabytes in OPT-175B [37]).
https://arxiv.org/pdf/2311.15566.pdf