pentium3 / sys_reading

system paper reading notes
235 stars 12 forks source link

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving #256

Closed pentium3 closed 8 months ago

pentium3 commented 1 year ago

https://arxiv.org/pdf/2302.11665.pdf

pentium3 commented 9 months ago

https://www.usenix.org/conference/osdi23/presentation/li-zhouhan

https://zhuanlan.zhihu.com/p/643392335

pentium3 commented 8 months ago

summary

key problem

workload

ML serving, with

optimization goal

latency SLO attainment

configurations to tune

scenario

request-response paradigm. Serving environment running in datacenter, with homogeneous devices.

technique

Alpa(DP + ILP) + greedy search

dynamic workload?

yes

multi-tenant?

yes. do inference of multiple models on the same cluster.

implementation

The real system is implemented on top of an existing model-parallel training system, Alpa.

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

model parallelism have been well studied in the throughput-oriented training setting. However, its effects for model serving under latency-sensitive settings remains largely untapped.

motivation study: [ch3] show than model parallelism benefits serving multiple models (reduce serving latency and improve resource utilization in the presence of bursty workloads) through statistical multiplexing under these assumptions:

[ch3.3] and Fig9 further analyzed the effect of inter-op and intra-op in terms of throughput / latency:

problem/challenge: decision search space is large.

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

xxxxxxxxx

Solution description

explain how the solution work

Planning phase:

split models to buckets, then split devices to groups, then do placement on each bucket

  1. Algorithm2

    • input: all models to serve, a set of devices, a workload $W$ (history request traces)
      1. run get_potential_model_buckets to cluster models into $k$ model buckets so that each bucket contains models with similar size (and thus similar serving latency). enumerate all possible model bucket partitions.
      2. run get_potential_device_buckets to assign a set of devices to each bucket. enumerate all the potential assignments
      3. For a model bucket $B_i$ (contains several models) and its device bucket $H_i$ (contains a set of devices),
      4. run get_potential_group_partitions to get and enumerate all possible partition plan $G$ that partition the devices in the bucket $H_i$ into several groups.
      5. Then for each plan $G$, run get_potential_parallel_configs to get and enumerate all possible plan $P$ that decide parallel configurations for each group.
      6. Then for each plan $P$, call Algorithm1 greedy_selection with input $(B_i, G, P, W)$. It will return a placement solution for $B_i$ on groups of devices in plan $G$, which means placement plan of bucket $B_i$.
      7. keep the best plan ($plm^*_i$) for bucket $B_i$ when enumerating all these possibilities
      8. keep the best plan ($best_plm$) among all possible model/device bucket partition plans when enumerating all these possibilities
    • output: return placement plan of all bucket. (we get the placement for each bucket individually)
  2. Algorithm1

    • input: a list of models M, a list of device groups G, group parallel configurations P, workload W, beam size k (default = 1).
      1. use beam search algorithm. for each iteration,
      2. enumerates all possible (model, group) pairs
      3. call modified Alpa process to get a parallelize plan for this model
      4. checks whether the model can be put on the group under the memory constraint
      5. call simulator and computes the SLO attainment rate of this plan (simulate the whole request trace and calc SLO attainment rate over all requests)
    • output: a static placement plan of the list of models M on the given list of device groups G, that has highest SLO attainment rate during simulation

image

Runtime Scheduling

All requests are sent to a centralized controller. The controller dispatches each request to the group with the shortest queue length. Each group manages a first-come-first-serve queue. When a group receives a request, it checks whether it can serve the request under SLO and rejects the request if it cannot.

see Fig 11

Important results

describe the experimental setup
summarize the main results

hardware: a cluster with 8 nodes and 64 GPUs. each node has 8 V100 GPUs

workloads:

Baselines to compare with:

  1. Selective Replication (SR): simplified AlpaServe without model parallelism, to mimics the policy of a wide range of existing serving systems
  2. Clockwork++, related work

results

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

when doesn't it work?

assumptions?

Closely related work

list of main competitors and how they differ

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work