summary

key problem

workload

Prediction pipelines with multiple machine learning models and data transformations to support complex prediction tasks (see Fig2 for examples). A pipeline can include conditional logics: a subset of models are invoked based on the output of earlier models in the pipeline.

When using InferLine, users provide a driver program, sample query trace used for planning, and a (tail) latency service level objective.

sample trace: assumed input rate over time. could be characterized as a stochastic processes mention in [ch2.1]
one query (driver program) means one pipeline

optimization goal

satisfy user provided end-to-end tail latency SLO while minimizing cost

configurations to tune

For each stage in the pipeline, decide:

the appropriate hardware accelerator (e.g., CPU, GPU, TPU)
batching parameters (batch size)
replication factor (parallelism)

scenario

one pipeline running on a fixed set of resources in datacenter

technique

low-freq planner: enumerate + greedy to explore all actions high-freq planner: a linear model to calc parallelism (since they assume models scale horizontally)

dynamic workload?

yes. dynamic rate

multi-tenant?

one pipeline with multiple ML models and data transformations

implementation

All experiments used Clipper [9] as the prediction-serving framework except for those in Fig. 14 which compare InferLine running on Clipper and TensorFlow Serving [37]. Both prediction-serving frameworks were modified to add a centralized batched queueing system.

see https://github.com/ucbrise/clipper

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

problem: decide the following configurations for each stage of an inference pipeline:

the appropriate hardware accelerator (e.g., CPU, GPU, TPU)
batching parameters (batch size)
replication factor (parallelism)

challenges:

large combinatorial configuration space
Unpredictable input rate

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

Solution description

explain how the solution work

2 phases

1. Low-Frequency Planning

- Profiler: [ch4.1]

for each of the models (stages) in the pipeline, profile model throughput (and how long it takes to process the given batch size) under different configurations of (hardware type, batch size), with single replica.

assume models scale horizontally.
evaluating the model in isolation [ch3]

- Estimator: [ch4.2]

rapidly estimating the end-to-end latency of a given (whole) pipeline configuration for the sample query trace.

input: a pipeline configuration, the individual model profiles, and a sample trace of the query workload
output: accurate estimates of the latency for each query in the trace.
simulating the entire pipeline, including queueing delays and conditional control flow (using the scale factor s).

- Planner: [ch4.3]

the planning algorithm is an iterative constrained optimization procedure that greedily minimizes cost while ensuring that the latency constraint is satisfied.

Initialization (Algorithm 1): Find a feasible configuration that can meet tail latency SLO, and ignore cost.

lines 2-5: Use an initial latency-minimizing configuration constructed by setting the batch size to 1 and using the lowest latency hardware available for each model.
lines 6-7: If the service time (aggregated e2e latency of all stages in the longest path of the pipeline) under this configuration still cannot meet SLO, it means the latency constraint is infeasible. Quit
lines 9-11: Otherwise, Feasible(pipeline, slo) use the Estimator to see whether the current configuration can meet tail latency SLO. If cannot meet, find the bottleneck model (the one with lowest throughput), increase its parallelism by 1. Repeat until can meet tail latency SLO

Cost-Minimization (Algorithm 2): try to reduce cost while still meeting SLO, based on initial configuration

Line 2: Based on initial configuration

Line 6-16: Iterate the following process:

For each model (stage) in the pipeline, 
For each action, 
    Apply this action on this model and make a new configuration 
    If the new configuration is feasible and has lower cost, mark it as better configuration.

Repeat this process until no better configuration can be found

here actions including:

Increase batch size:
Remove a replica (parallelism):
Downgrade Hardware to a lower cost one:
effect of these 3 actions:
- Increasing the batch size does increase latency, but maybe help when the other two create a infeasible configurations.
- Removing replicas is feasible when a previous iteration of the algorithm has increased the batch size for a model, increasing the per-replica throughput.
- the reduction in hardware price sometimes compensates for the increased replication factor

Result: finds an efficient, low-cost configuration that is guaranteed to meet the provided latency objective under the provided sample trace.

High-Frequency Planning [ch5]

For real-time dynamic workload, when the serving workload deviates from the sample, use a Tuner to detect the changes, and takes the appropriate scaling action to maintain both the latency constraint and cost-efficiency objective.

comments: while in low-freq planning we have 3 actions to change: batch size, hardware, parallelism, in high-freq planning we only change parallelism.

- detect workload change

use traffic envelope to characterize the workloads by measuring the maximum arrival rate for several different window sizes. For a query arrival process, construct the following histogram:

X-axis: $w_i$ indicates a time window size.
- start from $w_1=\triangle T$, then $w_2=2 \triangle T$, $w_4=4 \triangle T$, ... (window size grows exponentially)
Y-axis: maximum arrival rate for all possible sliding windows with size $w_i$
- for each $x=w_i$, measure [ $y=q_i$ =max(total num of queries in any time window with size $w_i$ over the duration of the trace) ]

Then in this histogram, smaller windows measure burstiness, while larger windows measure overall request rate.

During low-freq planning, the Planner constructs the traffic envelope for the sample arrival trace. It gives us an upper bound of the workload rate under initial configuration produce by planner.

During high-freq planning, The Tuner continuously computes the live traffic envelope for the current arrival workload.

- Scaling Up (Algorithm 3)

Line 2-6: for each window size, check if any of the current rates exceed their corresponding sample rates.
- in Line 4, $r_{obs}$ is the maximum arrival rate of windows with size $w_i$
- keep $r{max}=max(r{obs})$, which indicates the current workload rate that triggered scaling up (since it is the max sample over the process of current arrival rate)
Line 7-8: if triggered scaling up, then make scaling decisions for each model.
Line 9-13: calc new parallelism for model $m$.
- $s_m$ is the scale factor for model $m$, which prevents over-provisioning for a model that only receives a portion of the queries due to conditional logic.
- $\mu_m$ is the profiled throughput for model $m$ in its current configuration
- $ρ_m$ is the max-provisioning ratio, which ensures enough slack remains in the model to handle bursts

an example:

- Scaling Down (Algorithm 4)

InferLine takes a conservative approach to scaling down the pipeline to prevent unnecessary configuration oscillation.

Line 2: delay 15s after any configuration changes
Line 3: continuously computes the max request rate $λ_{new}$ that has been observed over the last 30 seconds, using 5 second windows.
Line 4-8, for each model, re-calculate parallelism (using the similar formula as scaling up).

Important results

describe the experimental setup
summarize the main results

environment:

client: one aws instance with 64 vCPUs, 256GB memory, 25Gbps network

servers: 16 aws instance. each has 8 NVIDIA K80 GPUs, 32 vCPUs, 488GB memory, 10Gbps network

cost of CPU/GPU are based on aws instance cost.

workload:

pipeline: as shown in Fig2

rate: 2 sets

1. generated synthetic traces by sampling inter-arrival times from a gamma distribution.
1. traces derived from real workloads studied in the AutoScale system

baseline:

Coarse-Grained Baseline:

In current systems, the individual pipeline components are each deployed as a separate micro-service to a prediction serving system such as [9, 15, 36, 37] and a pipeline is manually constructed by individual calls to each service. Any performance tuning for end-to-end latency or cost treats the entire pipeline as a single black-box service and tunes it as a whole. We therefore use this same approach as our baseline for comparison.
In this baseline, we profile the entire pipeline as a single black box to identify the single maximum batch size capable of meeting the SLO, in contrast to InferLine’s per-model profiling.
Here we use 2 variants of Coarse-Grained Baseline: CG-Mean and CG-Peak.

result:

Fig 6: run InferLine with only low-freq planner. InferLine meets latency SLOs at the lowest cost.

Fig 7: run InferLine with both high-freq and low-freq planner, under real workload ( 7a and 7b are under different real workload pattern).

Planner finds a 5x cheaper initial configuration than coarse-grained provisioning (Fig. 7(a)).
when the big spike occurs we observe that InferLine’s Tuner quickly reacts by scaling up the pipeline

Fig 8: same as Fig7, but under a set of synthetic workloads with increasing arrival rates

the traffic envelope monitoring described in §5 enables InferLine to detect the increase in arrival rate earlier and therefore scale up the pipeline sooner to maintain a low SLO miss rate.

Fig 9: how closely the latency distribution produced by the Estimator reflects the latency distribution of the running system

Fig 10: Planner Sensitivity: Planner’s performance (cost of its given cfg) under varying load, burstiness, and end-to-end latency SLOs.

Fig 11: Tuner Sensitivity. a pipeline with only the Planner enabled will miss latency SLO during burstiness, if the real arrival rate is different from sample trace.

Fig 12: Tuner Sensitivity. Planner+Tuner can detect deviation from expected arrival burstiness and react to meet the latency SLOs.

Fig 13: Attribution of benefit of InferLine low-frequency Planner and high-frequency Tuner

Fig 14: evaluate the InferLine Planner running on top of both Clipper and TensorFlow Serving

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

assumptions:

assume models scale horizontally.
the available hardware has a total ordering of latency across all batch sizes. (hardware A always slower than hardware B for all batch size)
- there may be settings where one accelerator is slower than another at smaller batch sizes but faster at larger batch sizes.
the inference latency of ML models is independent of their input

Closely related work

list of main competitors and how they differ

there are many existing works about dynamic scaling of data streaming systems, [ch8]
- these systems focus their effort on supporting more traditional data processing workloads, which include stateful aggregation operators and support for a variety of windowing operations. tring to maximize throughput and minimize backpressure. While in serving pipeline, there is no backpressure metrics. Overloaded records will be queued in each stage. Queueing delay must be explicitly considered during pipeline configuration. [ch2.1]
- Both systems provision for the average ingest rate, while InferLine maintains a traffic envelope of the request workload
- InferLine’s Tuner automatically provisions for worst-case (tail) latencies.

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work

low-freq planner makes action based on simulating the trace does not quite convince me (same as #256 paper). Not sure what will happen if the trace is far away from real workload. Will that mislead the algorithm and make a wrong action? There should be some way to do action purely based on real-time workload rate.
- they seem evaluated this case in Fig11-12 (Tuner Sensitivity), but not sure how far the real workload is away from the sample trace in their evaluation.
- since we only tune 3 cfgs for each stage, and we have profiles under all cfg, we can extend the high-freq planner to take all 3 cfgs into account
traffic envelope might be a good tool to characterise the workload rate process (see Fig 8), especially for modelling burstiness.

pentium3 / sys_reading

InferLine: Latency-Aware Provisioning and Scaling for Prediction Serving Pipelines #188