Introduce a nutanix prism client cache

thunderboltsid commented 7 months ago

During a recent incident, it was observed that creating a new Nutanix client for each request implies basic authentication for every request. This places unnecessary stress on IAM services. This stress was particularly problematic when the IAM services were already in a degraded state, thereby prolonging recovery efforts. Each basic auth request gets processed through the entire IAM stack, significantly increasing the load and impacting performance.

It's recommend that the client use session-auth cookies instead of basic auth for requests to Prism Central where possible. Given how the CAPX controller works currently, a new client is created per reconcile cycle. In https://github.com/nutanix-cloud-native/cluster-api-provider-nutanix/pull/398 we switched to using Session-Auth instead of Basic-Auth. However, switching from Basic-Auth to Session-Auth alone wouldn’t solve the problem of consistent Basic-Auth calls. This is because each time a client is created, which is every reconcile cycle, it will still result in one Basic-Auth call to /users/me to fetch the session cookie. To alleviate this, we are adding a cache of clients and reusing the client from the cache across reconciliation cycles of the same cluster for both the NutanixCluster and NutanixMachine reconciliation.

In a large-scale setup of 40+ clusters w/ 4 nodes each, we were able to see a noticeable drop in QPS to the IAM stack for the oidc/token calls. Before the client caching, a controller restart led to 10+ QPS on oidc/token endpoint with a steady state at around 0.5 QPS. After deploying the client cache changes, we saw a peak of ~3 QPS as caches warmed up and dropped to 0 QPS afterwards with sporadic requests only when session token refresh was needed. As we can see, with the changes proposed in this document, we were able to reduce the number of high-impact calls to IAM significantly.

codecov[bot] commented 7 months ago

Codecov Report

Attention: Patch coverage is 35.08772% with 37 lines in your changes are missing coverage. Please review.

Project coverage is 28.37%. Comparing base (8228f42) to head (00a4b6c).

Files	Patch %	Lines
controllers/nutanixmachine_controller.go	0.00%	17 Missing :warning:
controllers/helpers.go	50.00%	14 Missing :warning:
controllers/nutanixcluster_controller.go	25.00%	6 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #415 +/- ## ========================================== + Coverage 26.91% 28.37% +1.46% ========================================== Files 19 14 -5 Lines 1360 1304 -56 ========================================== + Hits 366 370 +4 + Misses 994 934 -60 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

thunderboltsid commented 7 months ago

/retest

thunderboltsid commented 6 months ago

/retest

thunderboltsid commented 6 months ago

/test e2e-k8s-conformance

thunderboltsid commented 6 months ago

/test e2e-nutanix-features

thunderboltsid commented 6 months ago

/test e2e-nutanix-features

thunderboltsid commented 6 months ago

/retest

thunderboltsid commented 6 months ago

/retest

thunderboltsid commented 6 months ago

/retest

thunderboltsid commented 6 months ago

/test e2e-k8s-conformance /test e2e-capx-controller-upgrade

thunderboltsid commented 6 months ago

/test e2e-k8s-conformance

nutanix-cn-prow-bot[bot] commented 6 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adiantum, dkoshkin, thunderboltsid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/nutanix-cloud-native/cluster-api-provider-nutanix/blob/main/OWNERS)~~ [adiantum,thunderboltsid] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

thunderboltsid commented 6 months ago

/retest

thunderboltsid commented 6 months ago

/test e2e-capx-scaling

nutanix-cloud-native / cluster-api-provider-nutanix

Introduce a nutanix prism client cache #415

Codecov Report