Introduce a random jitter in the Manifest CR loop requeue interval

Tomasz-Smelcerz-SAP commented 1 month ago

Description

Currently we re-queue items in the Manifest Reconciliation loop using a fixed interval. This has a side-effect: Most of the objects are processed in fixed intervals, resulting in a fixed "processing frequency": we observe regular peaks in CPU utilization, queue depth, active workers count, etc. It is inefficient: we require over 3 CPU cores at peaks, and less than 1 core between the peeks. On average, we consume up to 2 cores. If we "smooth out" our queuing schedule, we'll be able to reduce resource consumption (CPU count) by 1/3.

Reasons

Improve system behaviour

Acceptance Criteria

[x] The scheduling algorithm is configurable via runtime flags
[x] The requeue of items is more regular (no distinct peaks of processing). See the picture in the comment.

Feature Testing

No response

Testing approach

No response

Attachments

This is how it currently looks - inefficient CPU utilization:

Tomasz-Smelcerz-SAP commented 1 month ago

This is what we want to achieve:

Tomasz-Smelcerz-SAP commented 1 month ago

I have performed a simulation of the random jitter on the scheduling behavior - in the idealized case. The goal was to find out if the jitter solution works and how long does it take to make the object scheduling times "uniform enough".

Assumptions:

1000 objects in the virtual queue
the "scheduling time" is by default set to five minutes
no processing time - object is immediately re-scheduled
scheduling of one object doesn't affect scheduling of any other objects. Every object is processed as if there are no other objects in the system.

The random jitter was introduced in the following way: When scheduling the object, first decide if to introduce the jitter at all, otherwise just use standard time (5 minutes). It is a p-decide parameter. If it was decided to introduce the jitter, modify the scheduling time (5 minutes) by a random jitter with some maximum value. It is p-jitter parameter. For example, if the p-jitter is 2%, the initial value of 1000 changes into a new value from a range <980...1020>.

The following graphs were collected using p-decide == p-jitter = 0.02 (two percent) Every graph is a histogram of objects scheduling times during a 4h window. The first graph starts with a peak of a 1000 objects scheduled at exactly the same time - it's the initial schedule for all objects (starting point).

Tomasz-Smelcerz-SAP commented 1 month ago

Conclusion:

Introducing a really minimal jitter (2% chance of adding a jitter, max 2% jitter) for the first 24 hours is enough to spread objects processing times so that no visible spikes are present. There are some low-fequency oscillations still present, but these are small enough (and wide enough) I consider these acceptable.

This has been simulated in the idealized conditions. In reality, object processing times have some inherent variability that reduces the processing "spikes" - natural spreading occurs. But this mechanism, as observed, was not able to eliminate all processing spikes even after a week of running in the actual environment. Introducing some random jitter allows to shorten this time significantly - especially for bigger jitter values, e.g: 4% instead of 2%.

The jitter algorithm itself may be uptime controlled. The simulation shows there's no need to apply it after the first "n" hours - 24h in the presented case of 2% jitter. It means we can skip adding the jitter (or reduce it to zero) after some initial time. For simplicity of the implementation though, we can also leave it running all the time. The related performance cost of applying the jitter is negligible.

kyma-project / lifecycle-manager