summary

key problem

workload

serve single ML model

optimization goal

reduce cost while meet latency SLO

we set two SLO requirements for MArk: [ch4.1]

(1) Response Time Threshold: A request is deemed fulfilled only if its response time is below $RT_{max}$.
(2) Service Level: The service is considered satisfactory only if at least $SL_{min}$ percent of requests are fulfilled.

configurations to tune

request batch size,
num and type of instances (dynamically)
may spawn Lambda or spot instance when necessary

scenario

cloud platform. using combination of VM + serverless + spot instance

MLaaS platform (eg, Amazon SageMaker) can do reactive scaling based on current load, but Provisioning Time (minutes) is much larger than Execution Time (< 1s). so cloud providers tend to hide provisioning time by over-provisioning

technique

heuristic + greedy

dynamic workload?

yes

multi-tenant?

implementation

a controller independent from serving framework (eg TensorFlow Serving)

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

goal: reduce cost while meet latency SLO for ML inference in cloud platform.

motivation study: Characterizing Model Serving in the Cloud. measure the performance / cost of ML inference in cloud platform with different configurations.

experiment setup: evaluate the peak inference performance of single model with TensorFlow Serving

cloud services

Infrastructure-as-a-Service (IaaS):

on-demand: dedicated VM. users pay for hours depending on instance types
spot instance: Provide access to unused EC2 capacity at a discounted price. but might be interrupted(terminated) indefinitely. The available CPU power can vary depending on AWS demand
Burstable instance: Designed for workloads with variable CPU usage. Always available as on-demand. They offer a baseline CPU performance level but allow short bursts exceeding that baseline by spending earned CPU credits. Generally cheaper than on-demand instances, but the cost can fluctuate depending on CPU usage.

Container-as-a-Service (CaaS):

run containers with specified resource configurations in the cloud

Function-as-a-Service (FaaS):

serverless functions. users can only specify the memory allocation for an instance

What service to use: IaaS, CaaS, or FaaS?

IaaS achieves the best cost and latency performance for ML model serving, and combining it with FaaS can potentially reduce over-provisioning while remaining scalable to spiky workloads

IaaS is used as the primary serving option, while FaaS can provide transient service while new IaaS instances are launching.

IaaS: Can we use burstable instances?

burstable instances are not for long-running compute-intensive services, but can use as transient backup resources. see ch5.2

IaaS: Big instances or small instances?

In on-demand CPU market, smaller instances have higher performance-cost ratio than the bigger ones, even though the latter provides shorter latency.

smaller instances with advanced CPU models are preferable as they achieve higher performance-cost ratio. while increasing instance size, performance improves, but sub-linearly.

IaaS: How does GPU compare with CPU?

batching can significantly improve the cost-effectiveness of larger CPU instances and GPU instances. but increasing batch size leads to both longer queuing latency and batch inference latency

Safe to use spot instances for ML serving, since it's stateless.

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

Based on the findings in motivation study, the paper proposed MArk, a scalable system that provides cost-effective, SLO-aware ML inference serving in AWS.

main idea: Combine IaaS’s cost advantage with FaaS’s scalability instead of overprovisioning IaaS, use FaaS to handle demand surge and spikes so that we can use FaaS to hide IaaS provisioning time without overprovisioning.

Solution description

explain how the solution work

input requests go to request queue, and are grouped into batches by the Batch Manager.

Proactive Controller:

receive workload metrics such as the request arrival rate, makes predictions and plans instances in advance. then sends the launching and destroying requests to EC2
With predictive scaling, further actions are needed to handle prediction errors and unexpected load surges.
If an incoming request cannot be served within a specified time $RT_{max}$, it will be handled by Lambda instances immediately.
- comment: how to know whether an incoming request cannot be served within a specified time $RT_{max}$? How many Lambda instances to use? not explained in paper
SLO Monitor: keep track of SLO compliance rate

Workload Prediction [ch4.2]

estimate the maximum request rate in the near future.

provide API for users to implement prediction algorithm, but also give a vanilla version of LSTM for multi-step workload prediction (predicts workload for multiple "one steps" into the future. ).

by default,

prediction unit (time interval) $P_u$ = 1min
prediction window (num of steps) $P_w$ = 60
$P_s$ = 5s
updates the predicted load for the next $P_w * P_u$ interval every $P_u$ time units (predict next 1 hour)
During each unit, MArk keeps sampling the arrival rate in consecutive short sample windows of $P_s$. It keeps track of the maximum arrival rate of the unit, and gets the maximum arrival rate array for the next $P_w$ units.

Batching [ch4.3]

Batch Manager fetches requests from the queue, and submits the batched requests, if either of the two limits is reached:

suppose $T_b$ is the time needed to process a batch (got by profiling), then batching should satisfy the following 2 constraints:

for each record, time of batch processing ($W_{batch}+Tb$) should not exceed $RT{max}$
for each record, time of batch processing ($W_{batch}+T_b$) should be less than the time processing a batch of requests sequentially without batching.

tune the following hyperparameters by profiling:

$W_{batch}$: maximum waiting time window for request batching.
$N_{batch}$: maximum batch size

then use heuristic to tune batch size: gradually increase the batch size from 1 until at least one of the following constraints no longer holds

Instance Provisioning [ch4.3]

dynamically decide how many instances to keep

at time $t_0$, given

$n$ types of instances can be used. $R = ( r_1, ... , r_n ) $ is the set of current running instances.
predict max request arrival rate for next $m$ steps. $F = ( F_1, ... , F_m )$ where $F_i$ is the predicted maximum rate in step t
For each instance type $i \in I$,
- $C_i$ is the instance capacity, measured by the maximum throughput of a given model (requests per hour).
- $P_i$ is the unit price
- $O_i$ is the launch overhead (cost due to the instance provisioning latency)
- For the current running instances, $O_i=0$
assuming most instances can get ready in $\tau$ time units after launching.

main idea: use online heuristic algorithm. determine what instances to launch and which instances to destroy at $t_0$ (and run the new set of instances from $t_0$ to $t_m$), so as to minimize the cost while meeting the target SLO.

fill with the cheapest instances one by one, considering the cost of running duration + launching overhead (no lauching overhead for current running instances).

comment:

I assume $r_i$ is the num of instance type $i$, but not explained in paper
by $n$ types of instances, do you include Lambda instances?
- if so, previously ch4.1 mentioned Lambda will be used to handle extra requests cannot be served within $RT_{max}$? then why also provision Lambda here to satisfy predicted rate?
- if not, algorithm 1 in paper didn't spawn Lambda. then which component spawn Lambda ?

SLO tracking [ch4.4]

constantly checks if the last M requests satisfy the SLO requirements,

if not, L instances of type T will be launched (c5.large by default).

comment: what's L? what's T? not explained in paper. I assume when SLO requirements is not satisfied, they will spawn some smallest instances (c5.large since it's most cost effective in motivation study) to replicate requests.

Spot Instance and Lambda Cold Start [ch4.5]

Spot Instance

when we use spot instances, reserve a few stopped burstable instances as cold standbys.
Once spot instances receive interruption notices, resume the corresponding amount of burstable instances to handle the transient requests.
when the regular spot instances capacity is back to normal, stop the standby burstable instances.

Lambda cold start If an incoming request cannot be served within a specified time $RT_{max}$, it will be handled by Lambda instances immediately. [ch4.1]

cold start overhead: Every time a new Lambda instance is launched, it needs to load the ML model, framework library and code in memory.

however, based on their evaluation, the latency/cost impact of cold starts is limited, since

Lambda instance is recycled after it stays inactive for 45 to 60 minutes
$1 can spin up 7K inception-v3 Lambda instances, which is capable of serving more than 20K requests per second

Important results

describe the experimental setup
summarize the main results

MArk-ondemand: only uses on-demand instances, MArk-spot: also uses spot instances with interruption-tolerant mechanism in ch4.5

setup: on AWS. 42 c5 (CPU) instances,10 m5 (CPU) instances, and 12 p2.xlarge (GPU) instances.

ML models: see Table 4. model size up to 343MB

workload (ML requests rate): 1. Twitter: predictable real workload based on Twitter trace. 2. MMPP: generated unpredictable spikes. See Fig5

baseline: AWS SageMaker
comment: does SageMaker use GPU at that time?

results:

Twitter: MArk can significantly reduce both the cost and latency compared with SageMaker. up to 6x cost reduction and 60% latency reduction.
- SLO-aware design of MArk helps reduce the queuing delay (in request queue)
- MArk-spot has higher tail latency since more requests are handled by Lambda compared with MArk-ondemand, but same average/median latency since spot market allows MArk-spot to opportunistically use large instances and GPU instances at cheaper price
MMPP: MArk can still meet the SLO requirements even when the workload is highly dynamic and unpredictable
in unexpected surge (spikes): increased rate at a specific time and measure latency timeline.
- Lambda-based fallback mechanism, which can immediately take over and cap the latency to prevent queue building up
in spot instance interruption:
- when receiving interruption notice, MArk resumes some burstable instances with temporal boosted performance as transient resources (as in ch4.5), and goes back to normal when new spot instances are ready.

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

assume model can fit into memory of one Lambda instance
need a master machine as controller

Closely related work

list of main competitors and how they differ

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work

Important problem. The measurement study is good. The key idea is also good (use Lambda to reduce over-provisioning), but the solution has many flaws and does not convince me.
- many details are not explained clearly
- relying on workload rate prediction
this paper is for serving one ML model with data parallelism (replicate requests to diff machine) on dedicated resources. the model size is also relatively small (even in its publish year).
- For large model with model parallelism, ML pipeline, or multiple model variants, will the result of measurement study change? I assume so. eg, higher communication requirements need better NIC on each instance. larger model need larger memory size
AWS default is always a very bad baseline :) . Any other baseline?
todo: compare with #277

pentium3 / sys_reading

MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving #319