kyma-project / lifecycle-manager

Controller that manages the lifecycle of Kyma Modules in your cluster.
http://kyma-project.io
Apache License 2.0
10 stars 30 forks source link

Introduce a random jitter in the Manifest CR loop requeue interval #1688

Closed Tomasz-Smelcerz-SAP closed 4 days ago

Tomasz-Smelcerz-SAP commented 1 month ago

Description

Currently we re-queue items in the Manifest Reconciliation loop using a fixed interval. This has a side-effect: Most of the objects are processed in fixed intervals, resulting in a fixed "processing frequency": we observe regular peaks in CPU utilization, queue depth, active workers count, etc. It is inefficient: we require over 3 CPU cores at peaks, and less than 1 core between the peeks. On average, we consume up to 2 cores. If we "smooth out" our queuing schedule, we'll be able to reduce resource consumption (CPU count) by 1/3.

Related issue: https://github.com/kyma-project/lifecycle-manager/issues/1684

Reasons

Acceptance Criteria

Feature Testing

No response

Testing approach

No response

Attachments

This is how it currently looks - inefficient CPU utilization:

image
Tomasz-Smelcerz-SAP commented 1 month ago

This is what we want to achieve:

image
Tomasz-Smelcerz-SAP commented 1 month ago

I have performed a simulation of the random jitter on the scheduling behavior - in the idealized case. The goal was to find out if the jitter solution works and how long does it take to make the object scheduling times "uniform enough".

Assumptions:

The random jitter was introduced in the following way: When scheduling the object, first decide if to introduce the jitter at all, otherwise just use standard time (5 minutes). It is a p-decide parameter. If it was decided to introduce the jitter, modify the scheduling time (5 minutes) by a random jitter with some maximum value. It is p-jitter parameter. For example, if the p-jitter is 2%, the initial value of 1000 changes into a new value from a range <980...1020>.

The following graphs were collected using p-decide == p-jitter = 0.02 (two percent) Every graph is a histogram of objects scheduling times during a 4h window. The first graph starts with a peak of a 1000 objects scheduled at exactly the same time - it's the initial schedule for all objects (starting point).

Image

Image

Image

Image

Image

Image

Image

Image

Image

Tomasz-Smelcerz-SAP commented 1 month ago

Conclusion:

Introducing a really minimal jitter (2% chance of adding a jitter, max 2% jitter) for the first 24 hours is enough to spread objects processing times so that no visible spikes are present. There are some low-fequency oscillations still present, but these are small enough (and wide enough) I consider these acceptable.

This has been simulated in the idealized conditions. In reality, object processing times have some inherent variability that reduces the processing "spikes" - natural spreading occurs. But this mechanism, as observed, was not able to eliminate all processing spikes even after a week of running in the actual environment. Introducing some random jitter allows to shorten this time significantly - especially for bigger jitter values, e.g: 4% instead of 2%.

The jitter algorithm itself may be uptime controlled. The simulation shows there's no need to apply it after the first "n" hours - 24h in the presented case of 2% jitter. It means we can skip adding the jitter (or reduce it to zero) after some initial time. For simplicity of the implementation though, we can also leave it running all the time. The related performance cost of applying the jitter is negligible.