Closed pentium3 closed 8 months ago
Prediction pipelines with multiple machine learning models and data transformations to support complex prediction tasks (see Fig2 for examples). A pipeline can include conditional logics: a subset of models are invoked based on the output of earlier models in the pipeline.
When using InferLine, users provide a driver program, sample query trace used for planning, and a (tail) latency service level objective.
satisfy user provided end-to-end tail latency SLO while minimizing cost
For each stage in the pipeline, decide:
one pipeline running on a fixed set of resources in datacenter
low-freq planner: enumerate + greedy to explore all actions high-freq planner: a linear model to calc parallelism (since they assume models scale horizontally)
yes. dynamic rate
one pipeline with multiple ML models and data transformations
All experiments used Clipper [9] as the prediction-serving framework except for those in Fig. 14 which compare InferLine running on Clipper and TensorFlow Serving [37]. Both prediction-serving frameworks were modified to add a centralized batched queueing system.
see https://github.com/ucbrise/clipper
what is the problem this paper is solving?
why is it important?
why is it challenging?
problem: decide the following configurations for each stage of an inference pipeline:
challenges:
describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?
explain how the solution work
2 phases
for each of the models (stages) in the pipeline, profile model throughput (and how long it takes to process the given batch size) under different configurations of (hardware type, batch size), with single replica.
rapidly estimating the end-to-end latency of a given (whole) pipeline configuration for the sample query trace.
the planning algorithm is an iterative constrained optimization procedure that greedily minimizes cost while ensuring that the latency constraint is satisfied.
Initialization (Algorithm 1): Find a feasible configuration that can meet tail latency SLO, and ignore cost.
Cost-Minimization (Algorithm 2): try to reduce cost while still meeting SLO, based on initial configuration
For each model (stage) in the pipeline,
For each action,
Apply this action on this model and make a new configuration
If the new configuration is feasible and has lower cost, mark it as better configuration.
here actions including:
Result: finds an efficient, low-cost configuration that is guaranteed to meet the provided latency objective under the provided sample trace.
For real-time dynamic workload, when the serving workload deviates from the sample, use a Tuner to detect the changes, and takes the appropriate scaling action to maintain both the latency constraint and cost-efficiency objective.
comments: while in low-freq planning we have 3 actions to change: batch size, hardware, parallelism, in high-freq planning we only change parallelism.
use traffic envelope to characterize the workloads by measuring the maximum arrival rate for several different window sizes. For a query arrival process, construct the following histogram:
Then in this histogram, smaller windows measure burstiness, while larger windows measure overall request rate.
During low-freq planning, the Planner constructs the traffic envelope for the sample arrival trace. It gives us an upper bound of the workload rate under initial configuration produce by planner.
During high-freq planning, The Tuner continuously computes the live traffic envelope for the current arrival workload.
an example:
InferLine takes a conservative approach to scaling down the pipeline to prevent unnecessary configuration oscillation.
describe the experimental setup
summarize the main results
environment:
client: one aws instance with 64 vCPUs, 256GB memory, 25Gbps network
servers: 16 aws instance. each has 8 NVIDIA K80 GPUs, 32 vCPUs, 488GB memory, 10Gbps network
cost of CPU/GPU are based on aws instance cost.
workload:
pipeline: as shown in Fig2
rate: 2 sets
baseline:
Coarse-Grained Baseline:
result:
Fig 6: run InferLine with only low-freq planner. InferLine meets latency SLOs at the lowest cost.
Fig 7: run InferLine with both high-freq and low-freq planner, under real workload ( 7a and 7b are under different real workload pattern).
Fig 8: same as Fig7, but under a set of synthetic workloads with increasing arrival rates
Fig 9: how closely the latency distribution produced by the Estimator reflects the latency distribution of the running system
Fig 10: Planner Sensitivity: Planner’s performance (cost of its given cfg) under varying load, burstiness, and end-to-end latency SLOs.
Fig 11: Tuner Sensitivity. a pipeline with only the Planner enabled will miss latency SLO during burstiness, if the real arrival rate is different from sample trace.
Fig 12: Tuner Sensitivity. Planner+Tuner can detect deviation from expected arrival burstiness and react to meet the latency SLOs.
Fig 13: Attribution of benefit of InferLine low-frequency Planner and high-frequency Tuner
Fig 14: evaluate the InferLine Planner running on top of both Clipper and TensorFlow Serving
when doesn't it work?
what assumptions does the paper make and when are they valid?
assumptions:
list of main competitors and how they differ
If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:
Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work
https://dl.acm.org/doi/pdf/10.1145/3419111.3421285