pentium3 / sys_reading

system paper reading notes
234 stars 12 forks source link

MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving #319

Closed pentium3 closed 6 months ago

pentium3 commented 7 months ago

https://www.usenix.org/conference/atc19/presentation/zhang-chengliang

pentium3 commented 6 months ago

summary

key problem

workload

serve single ML model

optimization goal

reduce cost while meet latency SLO

we set two SLO requirements for MArk: [ch4.1]

configurations to tune

scenario

cloud platform. using combination of VM + serverless + spot instance

MLaaS platform (eg, Amazon SageMaker) can do reactive scaling based on current load, but Provisioning Time (minutes) is much larger than Execution Time (< 1s). so cloud providers tend to hide provisioning time by over-provisioning

technique

heuristic + greedy

dynamic workload?

yes

multi-tenant?

no

implementation

a controller independent from serving framework (eg TensorFlow Serving)

Problem and motivation

what is the problem this paper is solving?
why is it important?
why is it challenging?

goal: reduce cost while meet latency SLO for ML inference in cloud platform.

motivation study: Characterizing Model Serving in the Cloud. measure the performance / cost of ML inference in cloud platform with different configurations.

experiment setup: evaluate the peak inference performance of single model with TensorFlow Serving

cloud services

Infrastructure-as-a-Service (IaaS):

Container-as-a-Service (CaaS):

Function-as-a-Service (FaaS):

What service to use: IaaS, CaaS, or FaaS?

IaaS achieves the best cost and latency performance for ML model serving, and combining it with FaaS can potentially reduce over-provisioning while remaining scalable to spiky workloads

IaaS is used as the primary serving option, while FaaS can provide transient service while new IaaS instances are launching.

IaaS: Can we use burstable instances?

burstable instances are not for long-running compute-intensive services, but can use as transient backup resources. see ch5.2

IaaS: Big instances or small instances?

In on-demand CPU market, smaller instances have higher performance-cost ratio than the bigger ones, even though the latter provides shorter latency.

smaller instances with advanced CPU models are preferable as they achieve higher performance-cost ratio. while increasing instance size, performance improves, but sub-linearly.

IaaS: How does GPU compare with CPU?

batching can significantly improve the cost-effectiveness of larger CPU instances and GPU instances. but increasing batch size leads to both longer queuing latency and batch inference latency

Safe to use spot instances for ML serving, since it's stateless.

Main ideas and insights

describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?

Based on the findings in motivation study, the paper proposed MArk, a scalable system that provides cost-effective, SLO-aware ML inference serving in AWS.

main idea: Combine IaaS’s cost advantage with FaaS’s scalability instead of overprovisioning IaaS, use FaaS to handle demand surge and spikes so that we can use FaaS to hide IaaS provisioning time without overprovisioning.

Solution description

explain how the solution work

image

input requests go to request queue, and are grouped into batches by the Batch Manager.

Proactive Controller:

Workload Prediction [ch4.2]

estimate the maximum request rate in the near future.

provide API for users to implement prediction algorithm, but also give a vanilla version of LSTM for multi-step workload prediction (predicts workload for multiple "one steps" into the future. ).

by default,

Batching [ch4.3]

Batch Manager fetches requests from the queue, and submits the batched requests, if either of the two limits is reached:

suppose $T_b$ is the time needed to process a batch (got by profiling), then batching should satisfy the following 2 constraints:

tune the following hyperparameters by profiling:

then use heuristic to tune batch size: gradually increase the batch size from 1 until at least one of the following constraints no longer holds

Instance Provisioning [ch4.3]

dynamically decide how many instances to keep

at time $t_0$, given

main idea: use online heuristic algorithm. determine what instances to launch and which instances to destroy at $t_0$ (and run the new set of instances from $t_0$ to $t_m$), so as to minimize the cost while meeting the target SLO.

fill with the cheapest instances one by one, considering the cost of running duration + launching overhead (no lauching overhead for current running instances).

image

comment:

SLO tracking [ch4.4]

constantly checks if the last M requests satisfy the SLO requirements,

if not, L instances of type T will be launched (c5.large by default).

comment: what's L? what's T? not explained in paper. I assume when SLO requirements is not satisfied, they will spawn some smallest instances (c5.large since it's most cost effective in motivation study) to replicate requests.

Spot Instance and Lambda Cold Start [ch4.5]

Spot Instance

Lambda cold start If an incoming request cannot be served within a specified time $RT_{max}$, it will be handled by Lambda instances immediately. [ch4.1]

cold start overhead: Every time a new Lambda instance is launched, it needs to load the ML model, framework library and code in memory.

however, based on their evaluation, the latency/cost impact of cold starts is limited, since

Important results

describe the experimental setup
summarize the main results

MArk-ondemand: only uses on-demand instances, MArk-spot: also uses spot instances with interruption-tolerant mechanism in ch4.5

setup: on AWS. 42 c5 (CPU) instances,10 m5 (CPU) instances, and 12 p2.xlarge (GPU) instances.

ML models: see Table 4. model size up to 343MB

workload (ML requests rate): 1. Twitter: predictable real workload based on Twitter trace. 2. MMPP: generated unpredictable spikes. See Fig5

image

baseline: AWS SageMaker
comment: does SageMaker use GPU at that time?

results:

Limitations and opportunities for improvement

when doesn't it work?
what assumptions does the paper make and when are they valid?

Closely related work

list of main competitors and how they differ

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:

Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work