Closed pentium3 closed 8 months ago
serve single ML model
reduce cost while meet latency SLO
we set two SLO requirements for MArk: [ch4.1]
cloud platform. using combination of VM + serverless + spot instance
MLaaS platform (eg, Amazon SageMaker) can do reactive scaling based on current load, but Provisioning Time (minutes) is much larger than Execution Time (< 1s). so cloud providers tend to hide provisioning time by over-provisioning
heuristic + greedy
yes
no
a controller independent from serving framework (eg TensorFlow Serving)
what is the problem this paper is solving?
why is it important?
why is it challenging?
goal: reduce cost while meet latency SLO for ML inference in cloud platform.
motivation study: Characterizing Model Serving in the Cloud. measure the performance / cost of ML inference in cloud platform with different configurations.
experiment setup: evaluate the peak inference performance of single model with TensorFlow Serving
Infrastructure-as-a-Service (IaaS):
Container-as-a-Service (CaaS):
Function-as-a-Service (FaaS):
IaaS achieves the best cost and latency performance for ML model serving, and combining it with FaaS can potentially reduce over-provisioning while remaining scalable to spiky workloads
IaaS is used as the primary serving option, while FaaS can provide transient service while new IaaS instances are launching.
burstable instances are not for long-running compute-intensive services, but can use as transient backup resources. see ch5.2
In on-demand CPU market, smaller instances have higher performance-cost ratio than the bigger ones, even though the latter provides shorter latency.
smaller instances with advanced CPU models are preferable as they achieve higher performance-cost ratio. while increasing instance size, performance improves, but sub-linearly.
batching can significantly improve the cost-effectiveness of larger CPU instances and GPU instances. but increasing batch size leads to both longer queuing latency and batch inference latency
Safe to use spot instances for ML serving, since it's stateless.
describe the paper gist in 1-2 sentences
what is important to remember? What did we learn?
Based on the findings in motivation study, the paper proposed MArk, a scalable system that provides cost-effective, SLO-aware ML inference serving in AWS.
main idea: Combine IaaS’s cost advantage with FaaS’s scalability instead of overprovisioning IaaS, use FaaS to handle demand surge and spikes so that we can use FaaS to hide IaaS provisioning time without overprovisioning.
explain how the solution work
input requests go to request queue, and are grouped into batches by the Batch Manager.
Proactive Controller:
estimate the maximum request rate in the near future.
provide API for users to implement prediction algorithm, but also give a vanilla version of LSTM for multi-step workload prediction (predicts workload for multiple "one steps" into the future. ).
by default,
Batch Manager fetches requests from the queue, and submits the batched requests, if either of the two limits is reached:
suppose $T_b$ is the time needed to process a batch (got by profiling), then batching should satisfy the following 2 constraints:
tune the following hyperparameters by profiling:
then use heuristic to tune batch size: gradually increase the batch size from 1 until at least one of the following constraints no longer holds
dynamically decide how many instances to keep
at time $t_0$, given
main idea: use online heuristic algorithm. determine what instances to launch and which instances to destroy at $t_0$ (and run the new set of instances from $t_0$ to $t_m$), so as to minimize the cost while meeting the target SLO.
fill with the cheapest instances one by one, considering the cost of running duration + launching overhead (no lauching overhead for current running instances).
comment:
constantly checks if the last M requests satisfy the SLO requirements,
if not, L instances of type T will be launched (c5.large by default).
comment: what's L? what's T? not explained in paper. I assume when SLO requirements is not satisfied, they will spawn some smallest instances (c5.large since it's most cost effective in motivation study) to replicate requests.
Spot Instance
Lambda cold start If an incoming request cannot be served within a specified time $RT_{max}$, it will be handled by Lambda instances immediately. [ch4.1]
cold start overhead: Every time a new Lambda instance is launched, it needs to load the ML model, framework library and code in memory.
however, based on their evaluation, the latency/cost impact of cold starts is limited, since
describe the experimental setup
summarize the main results
MArk-ondemand: only uses on-demand instances, MArk-spot: also uses spot instances with interruption-tolerant mechanism in ch4.5
setup: on AWS. 42 c5 (CPU) instances,10 m5 (CPU) instances, and 12 p2.xlarge (GPU) instances.
ML models: see Table 4. model size up to 343MB
workload (ML requests rate): 1. Twitter: predictable real workload based on Twitter trace. 2. MMPP: generated unpredictable spikes. See Fig5
baseline: AWS SageMaker
comment: does SageMaker use GPU at that time?
results:
when doesn't it work?
what assumptions does the paper make and when are they valid?
list of main competitors and how they differ
If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:
Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
Solve a generalization of the problem
Address one of the work's limitations
Solve the same problem in a different context
Solve the problem in a much larger scale
Apply the paper's methods to a different (but similar) problem
Solve a new problem created by this work
https://www.usenix.org/conference/atc19/presentation/zhang-chengliang