pentium3 / sys_reading

system paper reading notes
234 stars 12 forks source link

Efficiently Scaling Transformer Inference #348

Open pentium3 opened 7 months ago

pentium3 commented 7 months ago

https://proceedings.mlsys.org/paper_files/paper/2023/file/523f87e9d08e6071a3bbd150e6da40fb-Paper-mlsys2023.pdf

pentium3 commented 6 months ago

https://zhuanlan.zhihu.com/p/660715870

pentium3 commented 6 months ago

summary

key problem

workload

efficient generative inference for Transformer models. (while #256 can be generally applied for all DNN models)

large deep models, with tight latency targets and long sequence lengths

optimization goal

depend on requirements of downstream applications:

configurations to tune

model parallelization. how to partition Multi-Head Attention / FFN layer of Transformer block

scenario

datacenter. with TPU

technique

xxxxx

dynamic workload?

xxxxx

multi-tenant?

xxxxx

implementation

xxxxx

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

challenges [ch1]

metrics of inference job [ch2]

tradeoff space between latency/throughput/cost

ch2.1, Fig1

problem formulation [ch2.2, 3.1]

model

device layout [ch3.1]

tensor partition layouts [ch3.1]

communication collectives [ch3.1, Figure A.1 ]

inference stages

An inference request is executed in a batch of $B$ sequences. (for each sequence, we have) $L{input}$ tokens of input text, and generates $L{gen}$ tokens of output text. (the input tokens are all present at the start of the inference)

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

provide a set of engineering principles for how best to partition a model in order to scale Transformer inference

Solution description

explain how the solution work

Important results

describe the experimental setup
summarize the main results

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

Closely related work

list of main competitors and how they differ

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work