Open pentium3 opened 8 months ago
efficient generative inference for Transformer models. (while #256 can be generally applied for all DNN models)
large deep models, with tight latency targets and long sequence lengths
depend on requirements of downstream applications:
model parallelization. how to partition Multi-Head Attention / FFN layer of Transformer block
datacenter. with TPU
xxxxx
xxxxx
xxxxx
xxxxx
what is the problem this paper is solving?
why is it important?
why is it challenging?
ch2.1, Fig1
An inference request is executed in a batch of $B$ sequences. (for each sequence, we have) $L{input}$ tokens of input text, and generates $L{gen}$ tokens of output text. (the input tokens are all present at the start of the inference)
describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?
provide a set of engineering principles for how best to partition a model in order to scale Transformer inference
explain how the solution work
describe the experimental setup
summarize the main results
when doesn't it work?
what assumptions does the paper make and when are they valid?
list of main competitors and how they differ
If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:
Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work
https://proceedings.mlsys.org/paper_files/paper/2023/file/523f87e9d08e6071a3bbd150e6da40fb-Paper-mlsys2023.pdf