pentium3 / sys_reading

system paper reading notes
235 stars 12 forks source link

InferLine: Latency-Aware Provisioning and Scaling for Prediction Serving Pipelines #188

Closed pentium3 closed 8 months ago

pentium3 commented 1 year ago

https://dl.acm.org/doi/pdf/10.1145/3419111.3421285

pentium3 commented 1 year ago

https://luowle.tech/zh-cn/posts/inferline/

pentium3 commented 8 months ago

summary

key problem

workload

Prediction pipelines with multiple machine learning models and data transformations to support complex prediction tasks (see Fig2 for examples). A pipeline can include conditional logics: a subset of models are invoked based on the output of earlier models in the pipeline.

When using InferLine, users provide a driver program, sample query trace used for planning, and a (tail) latency service level objective.

optimization goal

satisfy user provided end-to-end tail latency SLO while minimizing cost

configurations to tune

For each stage in the pipeline, decide:

scenario

one pipeline running on a fixed set of resources in datacenter

technique

low-freq planner: enumerate + greedy to explore all actions high-freq planner: a linear model to calc parallelism (since they assume models scale horizontally)

dynamic workload?

yes. dynamic rate

multi-tenant?

one pipeline with multiple ML models and data transformations

implementation

All experiments used Clipper [9] as the prediction-serving framework except for those in Fig. 14 which compare InferLine running on Clipper and TensorFlow Serving [37]. Both prediction-serving frameworks were modified to add a centralized batched queueing system.

see https://github.com/ucbrise/clipper

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

problem: decide the following configurations for each stage of an inference pipeline:

challenges:

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

Solution description

explain how the solution work

2 phases

1. Low-Frequency Planning

- Profiler: [ch4.1]

for each of the models (stages) in the pipeline, profile model throughput (and how long it takes to process the given batch size) under different configurations of (hardware type, batch size), with single replica.

- Estimator: [ch4.2]

rapidly estimating the end-to-end latency of a given (whole) pipeline configuration for the sample query trace.

- Planner: [ch4.3]

the planning algorithm is an iterative constrained optimization procedure that greedily minimizes cost while ensuring that the latency constraint is satisfied.

Initialization (Algorithm 1): Find a feasible configuration that can meet tail latency SLO, and ignore cost.

image

Cost-Minimization (Algorithm 2): try to reduce cost while still meeting SLO, based on initial configuration

image

here actions including:

Result: finds an efficient, low-cost configuration that is guaranteed to meet the provided latency objective under the provided sample trace.

High-Frequency Planning [ch5]

For real-time dynamic workload, when the serving workload deviates from the sample, use a Tuner to detect the changes, and takes the appropriate scaling action to maintain both the latency constraint and cost-efficiency objective.

comments: while in low-freq planning we have 3 actions to change: batch size, hardware, parallelism, in high-freq planning we only change parallelism.

- detect workload change

use traffic envelope to characterize the workloads by measuring the maximum arrival rate for several different window sizes. For a query arrival process, construct the following histogram:

image

Then in this histogram, smaller windows measure burstiness, while larger windows measure overall request rate.

During low-freq planning, the Planner constructs the traffic envelope for the sample arrival trace. It gives us an upper bound of the workload rate under initial configuration produce by planner.

During high-freq planning, The Tuner continuously computes the live traffic envelope for the current arrival workload.

- Scaling Up (Algorithm 3)

image

an example:

image

- Scaling Down (Algorithm 4)

InferLine takes a conservative approach to scaling down the pipeline to prevent unnecessary configuration oscillation.

image

Important results

describe the experimental setup
summarize the main results

environment:

client: one aws instance with 64 vCPUs, 256GB memory, 25Gbps network

servers: 16 aws instance. each has 8 NVIDIA K80 GPUs, 32 vCPUs, 488GB memory, 10Gbps network

cost of CPU/GPU are based on aws instance cost.

workload:

pipeline: as shown in Fig2

image

rate: 2 sets

baseline:

Coarse-Grained Baseline:

result:

Fig 6: run InferLine with only low-freq planner. InferLine meets latency SLOs at the lowest cost.

Fig 7: run InferLine with both high-freq and low-freq planner, under real workload ( 7a and 7b are under different real workload pattern).

Fig 8: same as Fig7, but under a set of synthetic workloads with increasing arrival rates

Fig 9: how closely the latency distribution produced by the Estimator reflects the latency distribution of the running system

Fig 10: Planner Sensitivity: Planner’s performance (cost of its given cfg) under varying load, burstiness, and end-to-end latency SLOs.

Fig 11: Tuner Sensitivity. a pipeline with only the Planner enabled will miss latency SLO during burstiness, if the real arrival rate is different from sample trace.

Fig 12: Tuner Sensitivity. Planner+Tuner can detect deviation from expected arrival burstiness and react to meet the latency SLOs.

Fig 13: Attribution of benefit of InferLine low-frequency Planner and high-frequency Tuner

Fig 14: evaluate the InferLine Planner running on top of both Clipper and TensorFlow Serving

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

assumptions:

Closely related work

list of main competitors and how they differ

image

image

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work