[QG] Perform load and stress test to verify operator's behaviour under load

akgalwas commented 1 year ago

Description

We should verify how the operator behaves under load. To increase the stabilisation and reliability of the infrastructure manager, a performance test has to be implemented which verifies common use cases. Goals is to measure regularly our internally defined performance KPIs (benchmarking/load test), verify the limits of the application (stress test) and detect performance critical behaviours before the Infrastructure Manager gets deployed on a productive landscape (no memory leaks etc.).

Acceptance criteria:

[ ] Performance relevant metrics (RAM, CPU. Network, Queue-Length of controller etc.) of KIM are recorded during the test execution
- [ ] visualisation of the measured metrics in a Dashboard (e.g. Plutono, use Jellies Board as reference: https://plutono.cp.kyma.cloud.sap/d/O3DH7uunk/lifecycle-manager-overview?orgId=1&from=now-30d&to=now)
[ ] 3rd party systems are mocked to avoid an overload of external systems (e.g. Gardener service)
[ ] Run the load test
- [ ] Measure KIM's performance by increasing the amount of deployed Runtime CRs: 500/2000/5000/10000/15000
- [ ] Document test results (screenshots from Dashboard for evidence reasons / presentation to team)
- [ ] Share outcomes with the team

Reasons Before deploying the operator on production we must know its performance characteristic.

Depends on

https://github.com/kyma-project/infrastructure-manager/issues/192

kyma-bot commented 10 months ago

This issue or PR has been automatically marked as stale due to the lack of recent activity. Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 7d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Close this issue or PR with /close

If you think that I work incorrectly, kindly raise an issue with the problem.

/lifecycle stale

Disper commented 9 months ago

The IM was working on ~420 CRs and we have to still perform tests on ~1000 and ~5000 CRs.

Disper commented 8 months ago

On 24.01.2024 we've faced situation on PROD where it took 141 seconds to Infrastructure-manager to rotate the certificate since Gardener Cluster CR creation. (internal issue reference - no. 5012).

Disper commented 8 months ago

Capturing here a valuable comment from @piotrmiskiewicz, from internal Slack:

In my opinion IM should have something like priorities. If the GardenerCluster is new it should be processed as soon as possible. If the Kubeconfig must be rotated - it can wait few minutes. Please analyze why there is so much GardenerCluster to process, because changing the timeout from 2 to 3 minutes could also be not enough

I think we should somehow first understand more how increased load impacts the reconciliation time and the above could be a way to solve the issue.

piotrmiskiewicz commented 8 months ago

About load impacts - you could also consider some random values to avoid such peaks. I don't know the reason why there was such high load, but I can imagine, than randomized "rotation time" could decrease the peak.

tobiscr commented 8 months ago

Yip, we had similar problem of load-peaks in the reconciler. Adding a jitter helped to distribute the load over time (e.g. https://github.com/octo/retry/blob/master/jitter.go or a sample snippet from reconciler)

kyma-project / infrastructure-manager

[QG] Perform load and stress test to verify operator's behaviour under load #14