kyma-project / infrastructure-manager

Apache License 2.0
0 stars 10 forks source link

[QG] Perform load and stress test to verify operator's behaviour under load #14

Open akgalwas opened 1 year ago

akgalwas commented 1 year ago

Description

We should verify how the operator behaves under load. To increase the stabilisation and reliability of the infrastructure manager, a performance test has to be implemented which verifies common use cases. Goals is to measure regularly our internally defined performance KPIs (benchmarking/load test), verify the limits of the application (stress test) and detect performance critical behaviours before the Infrastructure Manager gets deployed on a productive landscape (no memory leaks etc.).

Acceptance criteria:

Reasons Before deploying the operator on production we must know its performance characteristic.

Depends on

kyma-bot commented 10 months ago

This issue or PR has been automatically marked as stale due to the lack of recent activity. Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

You can:

If you think that I work incorrectly, kindly raise an issue with the problem.

/lifecycle stale

Disper commented 9 months ago

The IM was working on ~420 CRs and we have to still perform tests on ~1000 and ~5000 CRs.

Disper commented 8 months ago

On 24.01.2024 we've faced situation on PROD where it took 141 seconds to Infrastructure-manager to rotate the certificate since Gardener Cluster CR creation. (internal issue reference - no. 5012).

Disper commented 8 months ago

Capturing here a valuable comment from @piotrmiskiewicz, from internal Slack:

In my opinion IM should have something like priorities. If the GardenerCluster is new it should be processed as soon as possible. If the Kubeconfig must be rotated - it can wait few minutes. Please analyze why there is so much GardenerCluster to process, because changing the timeout from 2 to 3 minutes could also be not enough

I think we should somehow first understand more how increased load impacts the reconciliation time and the above could be a way to solve the issue.

piotrmiskiewicz commented 8 months ago

About load impacts - you could also consider some random values to avoid such peaks. I don't know the reason why there was such high load, but I can imagine, than randomized "rotation time" could decrease the peak.

tobiscr commented 8 months ago

Yip, we had similar problem of load-peaks in the reconciler. Adding a jitter helped to distribute the load over time (e.g. https://github.com/octo/retry/blob/master/jitter.go or a sample snippet from reconciler)