Closed pentium3 closed 8 months ago
ML serving, with
latency SLO attainment
request-response paradigm. Serving environment running in datacenter, with homogeneous devices.
Alpa(DP + ILP) + greedy search
yes
yes. do inference of multiple models on the same cluster.
The real system is implemented on top of an existing model-parallel training system, Alpa.
what is the problem this paper is solving?
why is it important?
why is it challenging?
model parallelism have been well studied in the throughput-oriented training setting. However, its effects for model serving under latency-sensitive settings remains largely untapped.
motivation study: [ch3] show than model parallelism benefits serving multiple models (reduce serving latency and improve resource utilization in the presence of bursty workloads) through statistical multiplexing under these assumptions:
[ch3.3] and Fig9 further analyzed the effect of inter-op and intra-op in terms of throughput / latency:
problem/challenge: decision search space is large.
describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?
xxxxxxxxx
explain how the solution work
Planning phase:
split models to buckets, then split devices to groups, then do placement on each bucket
Algorithm2
get_potential_model_buckets
to cluster models into $k$ model buckets so that each bucket contains models with similar size (and thus similar serving latency). enumerate all possible model bucket partitions.get_potential_device_buckets
to assign a set of devices to each bucket. enumerate all the potential assignmentsget_potential_group_partitions
to get and enumerate all possible partition plan $G$ that partition the devices in the bucket $H_i$ into several groups. get_potential_parallel_configs
to get and enumerate all possible plan $P$ that decide parallel configurations for each group.greedy_selection
with input $(B_i, G, P, W)$. It will return a placement solution for $B_i$ on groups of devices in plan $G$, which means placement plan of bucket $B_i$.Algorithm1
simulator
and computes the SLO attainment rate of this plan (simulate the whole request trace and calc SLO attainment rate over all requests)Runtime Scheduling
All requests are sent to a centralized controller. The controller dispatches each request to the group with the shortest queue length. Each group manages a first-come-first-serve queue. When a group receives a request, it checks whether it can serve the request under SLO and rejects the request if it cannot.
see Fig 11
describe the experimental setup
summarize the main results
hardware: a cluster with 8 nodes and 64 GPUs. each node has 8 V100 GPUs
workloads:
Baselines to compare with:
results
when doesn't it work?
what assumptions does the paper make and when are they valid?
when doesn't it work?
assumptions?
in sec 4.2, "in AlpaServe, we assume we know the arrival process in advance. ". We assume we have the workload trace (about arrival rate over time), and we can use a simulator to replay this workload trace when optimizing placement plan.
[ch4.3] assume all models are always placed on the GPUs. The placement of models in AlpaServe can be updated in the periodic re-placement (e.g., every 24 hours).
list of main competitors and how they differ
If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:
Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work
https://arxiv.org/pdf/2302.11665.pdf