Closed pentium3 closed 8 months ago
LLM inference
special characteristic: LLM inference latency is difficult to predict because their response time depends on the output length, which can vary significantly [24, 39, 77], due to iterative output token generation. an LLM inference request carries a user specified input prompt, i.e., a list of tokens, based on which an instance of an LLM service iteratively generates tokens, one token per iteration, based on the prompt and all the previously generated tokens, until the end-of-sentence token is generated, making the total inference time nondeterministic. In each iteration, LLM caches intermediate computations in a KV cache to speed up the next token generation. [ch2.1]
xxxxx
xxxxx
serverless inference: In such setup, developers upload their LLM checkpoints, including model execution files and model parameter files, to a storage system. Upon receiving a request, the system uses a model loading scheduler to allocate available GPUs for starting up these checkpoints. A request router then directs the inference request to the selected GPUs.
xxxxx
xxxxx
xxxxx
xxxxx
what is the problem this paper is solving?
why is it important?
why is it challenging?
challenges: [ch2.2] when cold start new instance during peak load:
opportunities:
describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?
explain how the solution work
when loading model ckpt from SSD to GPU, use
motivation: when do we need live migration?
in this example, server 2 is already serving model A. now we want to serve model B. serving model B on server 2 is better since model B is already in server 2's memory, which has more locality (GPU>DRAM>SSD)
Live-migration-supported locality-driven policy prioritizes locality without disrupting Model A. It initially preloads Model A on Server 1, maintaining inference operations. When Model A is set on Server 1, its intermediate state is transferred there, continuing the inference seamlessly. Following this, Model B commences on Server 2, taking advantage of locality. This policy optimizes latency for both Models A and B.
Live Migration Process: token-based migration: only download the model checkpoint (which is read-only and unchanged during inference) to new server. send all current tokens and re-compute KV cache on the new server, instead of sending large KV cache directly.
For each inference request, decide which server to serve which model (may need model migration/loading). The scheduler incorporates estimators for model loading time and model migration time to choose the best server depending on the status of each server.
Estimating Model Loading Time: based on
calculate total download time for them.
Estimating Model Migration Time
$a×(t{in}+t{out} )+b $
scheduling policy For selecting the optimal server for model migration, ServerlessLLM employs a dynamic programming approach to minimize migration time. comments: detailed explanation??
describe the experimental setup
summarize the main results
baseline: random policy and #243
some flaws:
when doesn't it work?
what assumptions does the paper make and when are they valid?
list of main competitors and how they differ
If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:
Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work
https://arxiv.org/pdf/2401.14351.pdf