summary

key problem

workload

ML serving, with

serve multiple models, or multiple variations of the same large model (eg, from same pretrained model but fine-tuned for various downstream tasks. chatGPT users have their customized GPT)
with dynamic and bursty request rate. frequent spikes in demand of up to 50x the average
requests are latency-sensitive

optimization goal

latency SLO attainment

configurations to tune

find placement of multiple models, which contains mix of the following strategies:
- split devices to groups
- model-parallelism strategies (based on inter-op / intra-op / replication) to place models inside each group of dev
to maximize latency SLO attainment

scenario

request-response paradigm. Serving environment running in datacenter, with homogeneous devices.

technique

Alpa(DP + ILP) + greedy search

dynamic workload?

yes

multi-tenant?

yes. do inference of multiple models on the same cluster.

implementation

The real system is implemented on top of an existing model-parallel training system, Alpa.

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

model parallelism have been well studied in the throughput-oriented training setting. However, its effects for model serving under latency-sensitive settings remains largely untapped.

motivation study: [ch3] show than model parallelism benefits serving multiple models (reduce serving latency and improve resource utilization in the presence of bursty workloads) through statistical multiplexing under these assumptions:

the device memory is limited,
the request rate is low (so the resources are under-utilized without model parallelism. otherwise the dev will be already fully busy and no improvement space)
the request CV (coefficient of variance, 变异系数) is high (more bursty)
or the latency SLO is tight

[ch3.3] and Fig9 further analyzed the effect of inter-op and intra-op in terms of throughput / latency:

Because of the sequential dependence between the different stages, inter-op parallelism cannot reduce the execution latency of a single input data
intra-op parallelism can largely reduce the latency via the parallel execution of different GPUs, but will introduce higher communication overhead
inter-op parallelism attains higher throughput compared to intra-op parallelism, since it can pipeline the execution of different stages and only communicate a relatively small amount of data
both parallel methods split the model weight tensors across different GPUs, the total memory usage stays constant with increasing numbers of GPUs. This makes the statistical multiplexing of different GPUs across multiple models possible.

problem/challenge: decision search space is large.

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

xxxxxxxxx

Solution description

explain how the solution work

Planning phase:

split models to buckets, then split devices to groups, then do placement on each bucket

Algorithm2
- input: all models to serve, a set of devices, a workload $W$ (history request traces)
  1. run get_potential_model_buckets to cluster models into $k$ model buckets so that each bucket contains models with similar size (and thus similar serving latency). enumerate all possible model bucket partitions.
  2. run get_potential_device_buckets to assign a set of devices to each bucket. enumerate all the potential assignments
  3. For a model bucket $B_i$ (contains several models) and its device bucket $H_i$ (contains a set of devices),
  4. run get_potential_group_partitions to get and enumerate all possible partition plan $G$ that partition the devices in the bucket $H_i$ into several groups.
  5. Then for each plan $G$, run get_potential_parallel_configs to get and enumerate all possible plan $P$ that decide parallel configurations for each group.
  6. Then for each plan $P$, call Algorithm1 greedy_selection with input $(B_i, G, P, W)$. It will return a placement solution for $B_i$ on groups of devices in plan $G$, which means placement plan of bucket $B_i$.
  7. keep the best plan ($plm^*_i$) for bucket $B_i$ when enumerating all these possibilities
  8. keep the best plan ($best_plm$) among all possible model/device bucket partition plans when enumerating all these possibilities
- output: return placement plan of all bucket. (we get the placement for each bucket individually)
Algorithm1
- input: a list of models M, a list of device groups G, group parallel configurations P, workload W, beam size k (default = 1).
  1. use beam search algorithm. for each iteration,
  2. enumerates all possible (model, group) pairs
  3. call modified Alpa process to get a parallelize plan for this model
  4. checks whether the model can be put on the group under the memory constraint
  5. call simulator and computes the SLO attainment rate of this plan (simulate the whole request trace and calc SLO attainment rate over all requests)
- output: a static placement plan of the list of models M on the given list of device groups G, that has highest SLO attainment rate during simulation

Runtime Scheduling

All requests are sent to a centralized controller. The controller dispatches each request to the group with the shortest queue length. Each group manages a first-come-first-serve queue. When a group receives a request, it checks whether it can serve the request under SLO and rejects the request if it cannot.

see Fig 11

Important results

describe the experimental setup
summarize the main results

hardware: a cluster with 8 nodes and 64 GPUs. each node has 8 V100 GPUs

workloads:

models: BERT and MoE with different variations(model sizes)
request rate: ML inference trace MAF1 and MAF2, which are collected from Azure serverless function invocations in two weeks
model sets S1-S4: each model set contain multiple model instances. In each experiment, for each type of workload, we replay one trace on one model set. (see columns in Fig12)

Baselines to compare with:

Selective Replication (SR): simplified AlpaServe without model parallelism, to mimics the policy of a wide range of existing serving systems
Clockwork++, related work

results

AlpaServe outperforms the two baselines at all times and uses far fewer devices. [ch6.2]
model-parallel placement is by nature more robust to bursty traffic. [ch6.2]
model parallelism can avoid memory fractions and enable more flexible placement of models. [ch6.2]
for large model serving (model set S4), the traditional way of using dedicated GPUs to serve large models is not ideal. AlpaServe can exploit new opportunities for statistical multiplexing. [ch6.3]
- the solution found by AlpaServe slices the cluster evenly into two groups, each with the (4, 8) inter-/intra-op parallel configuration, and groups the models in a way that balances the requests between two groups.
if the traffic pattern changes and different from what their algorithms are assumed, [ch6.4]
- AlpaServe maintains good performance using a static placement generated from substantially different traffic patterns, and still outperforms Clockwork++ which do online adjustment.
AlpaServe do not need any batching. [ch6.5]

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

when doesn't it work?

The conclusion of [ch3] shows the condition when model parallelism can help. However, model parallelism comes with some overhead [ch3.3].
- When the request rate is very high, or the latency SLO is tight, many requests are queued, the system is able to achieve efficient cluster utilization and is bounded by its total processing capability, so there is no benefit to the statistical multiplexing afforded by model parallelism.
- Actually, the total processing capability now might be affected by the model parallelism overhead.

assumptions?

in sec 4.2, "in AlpaServe, we assume we know the arrival process in advance. ". We assume we have the workload trace (about arrival rate over time), and we can use a simulator to replay this workload trace when optimizing placement plan.
- Maybe hard in real case
- sec 6.4 discussed Robustness to Changing Traffic Patterns, and they show that statistical multiplexing helps even with a static placement. (but in my opinion dynamic re-placement should improve more even with statistical multiplexing)
[ch4.3] assume all models are always placed on the GPUs. The placement of models in AlpaServe can be updated in the periodic re-placement (e.g., every 24 hours).
- need manually trigger re-placement?
- what's the overhead (time) of this reconfiguration?
  - if they assume all models are always placed on the GPUs, then the overhead only includes time spent re-calculating placement, which is not mentioned in the paper. (I suspect the reason is that if we can get the trace in advance, we can do offline placement before requests coming, thus even long solving time is fine)
  - otherwise, we also need to move models between dev during reconfiguration.

Closely related work

list of main competitors and how they differ

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work

in sec2 the paper claims "dynamically scaling compute resources would be too slow on the critical path of each prediction request". Maybe not.
- eg, when request rate of all models are low for most of the time, the overall resource utilization will be low even with model parallelism. We still need system level dynamic resources scaling to reduce idle resources.

pentium3 / sys_reading

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving #256