summary

key problem

workload

LLM inference

special characteristic: LLM inference latency is difficult to predict because their response time depends on the output length, which can vary significantly [24, 39, 77], due to iterative output token generation. an LLM inference request carries a user specified input prompt, i.e., a list of tokens, based on which an instance of an LLM service iteratively generates tokens, one token per iteration, based on the prompt and all the previously generated tokens, until the end-of-sentence token is generated, making the total inference time nondeterministic. In each iteration, LLM caches intermediate computations in a KV cache to speed up the next token generation. [ch2.1]

optimization goal

xxxxx

configurations to tune

xxxxx

scenario

serverless inference: In such setup, developers upload their LLM checkpoints, including model execution files and model parameter files, to a storage system. Upon receiving a request, the system uses a model loading scheduler to allocate available GPUs for starting up these checkpoints. A request router then directs the inference request to the selected GPUs.

LLM checkpoints could be large. GBs or TBs
multiplex LLMs on shared GPUs
strict latency SLO and want to minimize cost

technique

xxxxx

dynamic workload?

xxxxx

multi-tenant?

xxxxx

implementation

xxxxx

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

challenges: [ch2.2] when cold start new instance during peak load:

loading model is very slow (download model ckpt from S3 to local SSD + load from SSD to GPU)
LLM inference latency is difficult to predict

opportunities:

significant portion of the GPU servers’ host memory and storage devices are under-utilized.
locality-driven inference: the inference requests should be assigned to those GPUs that already have the model data stored locally to avoid downloading. however, Existing serverless systems randomly assign models to available GPUs overlooking the locality factor.

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

Solution description

explain how the solution work

1. designed new ckpt format [ch4.1]

sequential chunk-based read design. tensors for each GPU are grouped in separate files
a separate index file containing metadata. allow direct computation of memory address comment: we still need to load the whole model to do inference?

2. Efficient Multi-Tier Checkpoint Loading [ch4.2]

when loading model ckpt from SSD to GPU, use

parallel DRAM to GPU PCIe links
direct file access in Linux
pinned memory

3. design Live Migration Process [ch5]

motivation: when do we need live migration?

in this example, server 2 is already serving model A. now we want to serve model B. serving model B on server 2 is better since model B is already in server 2's memory, which has more locality (GPU>DRAM>SSD)

Live-migration-supported locality-driven policy prioritizes locality without disrupting Model A. It initially preloads Model A on Server 1, maintaining inference operations. When Model A is set on Server 1, its intermediate state is transferred there, continuing the inference seamlessly. Following this, Model B commences on Server 2, taking advantage of locality. This policy optimizes latency for both Models A and B.

Live Migration Process: token-based migration: only download the model checkpoint (which is read-only and unchanged during inference) to new server. send all current tokens and re-compute KV cache on the new server, instead of sending large KV cache directly.

KV cache is used to store previously computed key and value pairs from the attention mechanism, so we can reuse these results for new inputs that share some context with previously seen inputs.
The KV cache stores the keys and values that are generated by the attention mechanism for each token within the context window. If the input exceeds the context window size, the model and its cache mechanism need to decide which parts of the input to keep and which to discard or replace as new input comes in. comment: "the KV cache can be recalculated from tokens with low latency, usually in the range of hundreds of milliseconds.". can we really re-calculate 100k tokens in hundreds of milliseconds?

4. Locality-Aware Server Allocation [ch6]

For each inference request, decide which server to serve which model (may need model migration/loading). The scheduler incorporates estimators for model loading time and model migration time to choose the best server depending on the status of each server.

Estimating Model Loading Time: based on

model ckpt size $n$
bandwidth $b$
previous estimations for the models already in the queue $q$

calculate total download time for them.

Estimating Model Migration Time

$a×(t{in}+t{out} )+b $

$t_{in}$: num of tokens in LLM input prompt
$t_{out}$: num of tokens generated so far
$a,b$: parameters

scheduling policy For selecting the optimal server for model migration, ServerlessLLM employs a dynamic programming approach to minimize migration time. comments: detailed explanation??

Important results

describe the experimental setup
summarize the main results

baseline: random policy and #243

some flaws:

in ch7.3 Fig 8, each experiment seems only deployed a single model in the cluster and has static RPS(requests per second). Then what's the usage of live migration? (I assume all machines are empty and require downloading model from S3 in the beginning?)
Fig 9 seems deployed 2 models in the cluster. It will be more interesting if run experiments with multiple models deployed in the same cluster.

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

Closely related work

list of main competitors and how they differ

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work

cold start is an important problem for scheduling LLM inference. The motivation of this paper is to leverage data locality to accelerate cold start, by choosing the server with highest locality (GPU>DRAM>SSD>remote storage) for each model. In order to fully utilize data locality, they allow live migration of models.
But the explanation of their solution is unclear. The most important part should be : for selecting the optimal server for model migration, how to use dynamic programming to formulate a problem that minimize migration time. The problem formulation should consider the model deployment of the whole cluster, and the characteristic of input requests
- The Locality-Aware Server Allocation policy should have a large impact even without redesigning new model ckpt format and local loading mechanism. But the paper only used 1 paragraph to cover this topic.
The paper suggests considering live migration when a new model serving request comes. However, in the real-world case, when the request rate of different model type change, model live migration might also help.
- eg. previously most of the requests are using model A. now most of the request use model B, thus model B needs more parallelism. We can live migrate model A (and sacrifice some latency of model A) and add more replica for model B. Formulate a problem to optimize both migration cost and serving latency when scaling-up model B.
some experiments are not clear. eg, are they only using 2 models in sec7.3 Fig 9? is the workload rate dynamic in sec7.4?
they assume the whole model should fit in one instance?

pentium3 / sys_reading

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models #320

summary

key problem

workload

optimization goal

configurations to tune

scenario

technique

dynamic workload?

multi-tenant?

implementation

Problem and motivation

Main ideas and insights

Solution description

1. designed new ckpt format [ch4.1]

2. Efficient Multi-Tier Checkpoint Loading [ch4.2]

3. design Live Migration Process [ch5]

4. Locality-Aware Server Allocation [ch6]

Important results

Limitations and opportunities for improvement

Closely related work

Follow-up research ideas (Optional)