calls between microservices triggered a user's request form a graph
e2e latency sensitive
microservice are overprovisioning to ensure performance
challenges:
difficult to do fine-grained scaling while satisfying SLA requirement. Microservice call graph is complicated, and e2e latency depends on all microservices within the graph.
microservice multiplexing. 40% microservices are shared by >=100 online services. services have diverse workload patterns and different SLA requirements.
resource interference is non-uniform. Performance imbalances between contains from the same microservices. A microservice usually has hundreds to thousands containers in cluster.
microservices latency can be formulated with a piece-wise linear function(分段线性函数) . slope and cut-off point depend on workload and resource interference.
For each microservice, Erms
performs offline profiling,
collect historical data including tail latency, workload(number of calls processed), CPU utilization, memory utilization
apply ML model (decision tree) to learn parameters in the piece-wise linear function, determining the cut-off point (σ_i) between the two linear pieces
This model captures how the microservice's latency changes with workload, and how this relationship is affected by resource utilization (CPU and memory) [sec5.2]
Latency targets in Erms represent the maximum time allocated to each microservice to process a request while still meeting the overall end-to-end latency SLA [sec2.2]
It first simplifies the complex dependency graph into a graph with only sequential dependencies, represented by these virtual microservices.
use DFS to simplify microservice graph. Merge microservices with sequential or parallel calls as virtual microservices [sec4.2, Algorithm1]
It then computes latency targets for all microservices by solving a convex optimization problem that minimizes total resource usage while meeting SLA requirements.
Objective Function: Minimize the total resource usage: min Σ(n_i * R_i) [equation 2]
Constraint: Ensure the end-to-end latency meets the SLA: Σ(a_i * γ_i / n_i + b_i) ≤ SLA [equation 2,4]
calculate initial latency target based on equation 5: Eq. (5) states that the optimal latency target of each microservice is in proportion to the square root of the product of ai , workload γi , and resource demand Ri.
a_i represents the slope(斜率) with respect to workload in the latency function of microservice i. It quantifies how sensitive the microservice's latency is to changes in workload. A larger a_i means the microservice's latency increases more rapidly with increasing workload.
1.1. Priority Level Assignment: based on initial latency target computed in step2. For each shared microservice, Erms assigns priority levels to the services that use it. services with lower latency targets at a shared microservice are given higher priority.
1.2. Priority-based Scheduling: Whenever a thread is available in a deployed container and there are requests waiting to be processed, a request from the service with higher priority will be assigned to this thread with higher probability. [sec5.3.2]
1.3. Latency Target Recomputation: For each service, recomputes latency targets for all microservices in each service. This recomputation takes into account the priority assigned to each services at shared microservices. (eg: For a service with priority k, its new workload at a shared microservice is the sum of workloads from all services with priority 1 to k. )
Resource Allocation: decide how many containers to allocate to each microservice, based on the initial latency targets and priority. Each microservice container is configured with 0.1 core and 200M memory. [sec4.3]
step4: Dynamic Resource Provisioning [sec5.4]
decides where to place the allocated containers across physical hosts to minimize resource interference and balance utilization.
For each host, the resource unbalance is defined as the difference between the host utilization and the cluster-wide utilization.
use POP optimization to solve this problem. The main idea is to statically divides the hosts of a cluster into multiple groups of the same size and solves a much smaller scale optimization problem.
https://dl.acm.org/doi/pdf/10.1145/3567955.3567964