pentium3 / sys_reading

system paper reading notes
229 stars 12 forks source link

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models #320

Closed pentium3 closed 3 months ago

pentium3 commented 4 months ago

https://arxiv.org/pdf/2401.14351.pdf

pentium3 commented 3 months ago

summary

key problem

workload

LLM inference

special characteristic: LLM inference latency is difficult to predict because their response time depends on the output length, which can vary significantly [24, 39, 77], due to iterative output token generation. an LLM inference request carries a user specified input prompt, i.e., a list of tokens, based on which an instance of an LLM service iteratively generates tokens, one token per iteration, based on the prompt and all the previously generated tokens, until the end-of-sentence token is generated, making the total inference time nondeterministic. In each iteration, LLM caches intermediate computations in a KV cache to speed up the next token generation. [ch2.1]

optimization goal

xxxxx

configurations to tune

xxxxx

scenario

serverless inference: In such setup, developers upload their LLM checkpoints, including model execution files and model parameter files, to a storage system. Upon receiving a request, the system uses a model loading scheduler to allocate available GPUs for starting up these checkpoints. A request router then directs the inference request to the selected GPUs.

image

technique

xxxxx

dynamic workload?

xxxxx

multi-tenant?

xxxxx

implementation

xxxxx

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

challenges: [ch2.2] when cold start new instance during peak load:

opportunities:

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

Solution description

explain how the solution work

1. designed new ckpt format [ch4.1]

2. Efficient Multi-Tier Checkpoint Loading [ch4.2]

when loading model ckpt from SSD to GPU, use

3. design Live Migration Process [ch5]

motivation: when do we need live migration?

in this example, server 2 is already serving model A. now we want to serve model B. serving model B on server 2 is better since model B is already in server 2's memory, which has more locality (GPU>DRAM>SSD)

image

Live-migration-supported locality-driven policy prioritizes locality without disrupting Model A. It initially preloads Model A on Server 1, maintaining inference operations. When Model A is set on Server 1, its intermediate state is transferred there, continuing the inference seamlessly. Following this, Model B commences on Server 2, taking advantage of locality. This policy optimizes latency for both Models A and B.

Live Migration Process: token-based migration: only download the model checkpoint (which is read-only and unchanged during inference) to new server. send all current tokens and re-compute KV cache on the new server, instead of sending large KV cache directly.

4. Locality-Aware Server Allocation [ch6]

For each inference request, decide which server to serve which model (may need model migration/loading). The scheduler incorporates estimators for model loading time and model migration time to choose the best server depending on the status of each server.

Estimating Model Loading Time: based on

calculate total download time for them.

Estimating Model Migration Time

$a×(t{in}+t{out} )+b $

scheduling policy For selecting the optimal server for model migration, ServerlessLLM employs a dynamic programming approach to minimize migration time. comments: detailed explanation??

Important results

describe the experimental setup
summarize the main results

baseline: random policy and #243

some flaws:

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

Closely related work

list of main competitors and how they differ

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work