neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
164 stars 21 forks source link

ca/auth: update Azure token when invalid #1134

Closed chaporgin closed 1 week ago

chaporgin commented 2 weeks ago

This changes the version of cluster autoscaler from tag cluster-autoscaler-1.27.8 to branch cluster-autoscaler-release-1.28, commit 10a229ac17ea8049248d1c3ce2923b94a4f9085c. Motivation:

We get an occasional error in Azure:

E1106 12:08:11.509971       1 azure_manager.go:177] Failed to regenerate Azure cache: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/REDUCTED/resourceGroups/MC_dev-eastus2-aks2_dev-azure-eastus2-aks2_eastus2/providers/Microsoft.Compute/virtualMachineScaleSets?api-version=2022-03-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {"error":"invalid_client","error_description":"AADSTS700024: Client assertion is not within its valid time range. Current time: 2024-11-06T12:08:11.4851735Z, assertion valid from 2024-11-04T18:55:21.0000000Z, expiry time of assertion 2024-11-04T19:55:21.0000000Z. Review the documentation at https://learn.microsoft.com/entra/identity-platform/certificate-credentials . Trace ID: 1c8e947d-f154-4052-9e8a-8529877f7c00 Correlation ID: b04f2ef9-09f7-4f4e-80f8-15d313c5568f Timestamp: 2024-11-06 12:08:11Z","error_codes":[700024],"timestamp":"2024-11-06 12:08:11Z","trace_id":"1c8e947d-f154-4052-9e8a-8529877f7c00","correlation_id":"b04f2ef9-09f7-4f4e-80f8-15d313c5568f","error_uri":"https://login.microsoftonline.com/error?code=700024"} Endpoint https://login.microsoftonline.com/c8350122-1697-4543-929a-d4a75d1bb552/oauth2/token?api-version=1.0

CA seems to have fixed that with recent versions by switching to the cloud-provider-azure package, which has a callback to reread the JWT token when needed. This is already present in the cluster-autoscaler-release-1.28 branch, but it is not present in the cluster-autoscaler-1.28.6 tag that I used previously in https://github.com/neondatabase/autoscaling/commit/26d39a6beea7c0921069e53f512bdf962bb26545. Instead, in this branch, the code reads JWT from the filesystem only once and does not consider that AKS will occasionally replace it.

Are we OK with versioning this as neondatabase/cluster-autoscaler-neonvm:k8s-1.28-2024-10-07?

https://github.com/neondatabase/cloud/issues/18284

edude03 commented 2 weeks ago

Are we OK with versioning this as neondatabase/cluster-autoscaler-neonvm:k8s-1.28-2024-10-07?

I think that's fine. Although if I was going to be super nitpicky I'd love if we had our own version as part of the tag like CA-0.1-k8s-128-api but that's basically bikeshedding

chaporgin commented 1 week ago

I'd love if we had our own version as part of the tag like CA-0.1-k8s-128-api

Noted about the version format, I will apply it in the next iterations, if any will take place.