Open akgalwas opened 1 year ago
This issue or PR has been automatically marked as stale due to the lack of recent activity. Thank you for your contributions.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
If you think that I work incorrectly, kindly raise an issue with the problem.
/lifecycle stale
The IM was working on ~420 CRs and we have to still perform tests on ~1000
and ~5000
CRs.
On 24.01.2024
we've faced situation on PROD where it took 141
seconds to Infrastructure-manager to rotate the certificate since Gardener Cluster
CR creation. (internal issue reference - no. 5012).
Capturing here a valuable comment from @piotrmiskiewicz, from internal Slack:
In my opinion IM should have something like priorities. If the GardenerCluster is new it should be processed as soon as possible. If the Kubeconfig must be rotated - it can wait few minutes. Please analyze why there is so much GardenerCluster to process, because changing the timeout from 2 to 3 minutes could also be not enough
I think we should somehow first understand more how increased load impacts the reconciliation time and the above could be a way to solve the issue.
About load impacts - you could also consider some random values to avoid such peaks. I don't know the reason why there was such high load, but I can imagine, than randomized "rotation time" could decrease the peak.
Yip, we had similar problem of load-peaks in the reconciler. Adding a jitter helped to distribute the load over time (e.g. https://github.com/octo/retry/blob/master/jitter.go or a sample snippet from reconciler)
Description
We should verify how the operator behaves under load. To increase the stabilisation and reliability of the infrastructure manager, a performance test has to be implemented which verifies common use cases. Goals is to measure regularly our internally defined performance KPIs (benchmarking/load test), verify the limits of the application (stress test) and detect performance critical behaviours before the Infrastructure Manager gets deployed on a productive landscape (no memory leaks etc.).
Acceptance criteria:
Reasons Before deploying the operator on production we must know its performance characteristic.
Depends on