Closed tshaiman closed 1 year ago
Hello! Nice catch! KEDA asks the queues even if there isn't any pod, but definitively the client should be reused in the scaler
Are you willing to contribute with the fix?
I think that a small refactor of this function could be enough.
Something like this could be enough:
func (s *azureServiceBusScaler) getServiceBusAdminClient() (*admin.Client, error) {
if s.client != nil {
return s.client, nil
}
var err error
var client *admin.Client
switch s.podIdentity.Provider {
case "", kedav1alpha1.PodIdentityProviderNone:
client, err = admin.NewClientFromConnectionString(s.metadata.connection, nil)
case kedav1alpha1.PodIdentityProviderAzure, kedav1alpha1.PodIdentityProviderAzureWorkload:
creds, chainedErr := azure.NewChainedCredential(s.podIdentity.IdentityID, s.podIdentity.Provider)
if chainedErr != nil {
return nil, err
}
client, err = admin.NewClient(s.metadata.fullyQualifiedNamespace, creds, nil)
default:
err = fmt.Errorf("incorrect podIdentity type")
}
s.client = client
return client, err
}
Great find!
Hey,
I want to tackle this issue
oh man you are quick i was just about to submit a PR 😂
you snooze you loose
@JorTurFer / @tomkerkhove / @xoanmm : I'm afriad that the fix in https://github.com/kedacore/keda/pull/4273 is now causing another issue.
I took the fix for a test drive ( build custom docker image) and after 1 hour I started getting Pod Identity Exception I think this is because Azure Tokens are valid for 1 hour and the code does not take this into account but keep using the cached admin client.
peeking at the code of Identity in Keda and with knowledge on how Azure Identity library work I got to say this entire Identity issue needs to be discussed.
having said that the fix is not working for a pod that is longed lived .
What error do you see? In theory, the service bus SDK should call to GetToken when the token has expired. I mean, maybe there is other error, but I think that this implementation is correct as we should cache the client instead of creating a new one each time. KEDA also recreates the scaler when there is any error getting the metrics, so even if the client is cached in the scaler, the whole scaler is regenerated on any error. I think that the error should be in other place.
I see the same-old ,most common error that keda does not have a valid token to use. I do agree that we should cache the client , but I was looking how tokens are being asked and used in Keda , which made me rethink about the entire issue we see here ( and in the infamous Workload Identity bugs we saw few months ago ) I started a discussion here : https://github.com/kedacore/keda/discussions/4279
I agree that the SDK needs to call GetToken behind the scenes , but as I opened the custom implementation/wrapper of Identity in keda , as described in the discussion , I'm not sure this will eventually happen since the usage of the well defined contract against Azure .Identity library is not met in this case.
I'm deploying that change into my cluster right now, let's wait until tomorrow to check how it looks. What are you using? Workload Identity or AAD-Pod-Identity?
pod-identity , as workload identity had a bug that related to their webhook few months ago
I don't see any error for the moment (76 min):
I'll keep both apps (aad-pod-identity and wi) deployed some days to check if there is any error
Nothing yet (125min), Let's see after 12-24 hours
what i did in my test was to replace only the fixed code ( saving admin client ) into v2.9.3 that could be insufficient with more fixes merged to master
nice tool !
is this Lens ?
On Sun, Feb 26, 2023 at 4:33 AM Jorge Turrado Ferrero < @.***> wrote:
Nothing yet (125min), Let's see after 12-24 hours [image: image] https://user-images.githubusercontent.com/36899226/221389014-fa42c9bf-c1df-4cb8-9b89-d4b12ccd6297.png
— Reply to this email directly, view it on GitHub https://github.com/kedacore/keda/issues/4262#issuecomment-1445252442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5FRI27ODBM7GKDHKLZ2LWZK6F5ANCNFSM6AAAAAAVDBX4V4 . You are receiving this because you authored the thread.Message ID: @.***>
is this Lens ?
yes, It's openlens (an open source fork of lens).
It isn't perfect and sometime I still need kubectl
, but for majority of the case works realy nice
Same behavior after 11 hours, I guess that we need to wait until 12/24 hours to ensure that the token is regenerated.
BTW, do you see any error in the aad-pod-identity pods? as it's an authentication proxy, maybe there is any error there that givesd more info about the problems
is it 2.9.3 with the fix , or is it s2.9.4 candidate. what i did is to apply the fix on this bug to 2/.9.3 which may be incorrect. can I get an image tag to test it ?
I am using 'main' tag because it's the only that contains the fix for the moment ('main' tag contains the next version changes)
Be careful trying that image because it contains the certificarte management (a new feature introduced to improve the security), so you can't just update the tag, there are required changes in the RBAC to allow this management. Other option is to manage the certificates from your side, but that's more complicated. Next version docs explains it https://keda.sh/docs/2.10/operate/security/#use-your-own-tls-certificates And we will release a blog post explaining how to do it with cert-manager after release
ok that is too much for now indeed i will just save the auth client on my 2.9.3 tag and rebuild with ‘make container-build׳
i don’t want you to spend time on this if after 12 hours it works - most chances i didn’t build correctly .
will update shortly
21H without any error:
Tomorrow morning I'll check if there is any error after 24h. I guess that if tomorrow this still works, the access token renewal is done correctly and I can close the issue again 😄
I created 2 ScaledObject, one for add-pod-identity
and other one for aad-wi
so both are covered with this test
@JorTurFer : conducted my tests as well and confirmed to be working correctly. the problem was running against "workload-identity" which is still not fixed completely by workload-identity team with the webhook and its policy .
thanks for all your effots and clarifications
Nice!
Keda Service Bus Scaler keeps asking for Arm GetAccess on Service Bus -> no reused of AdminClient
we saw a significant increase of getAcccess On Azure ARM Endpoint, aiming to authenticate that the request has access to the service bus.
azure_servicebus_scaler.go
a servicebusAdminClient is created but it is not assigned to any varaible on "s" (azureServiceBusScaler)
so each time a check is made to see if s.client is initialized , it is always be null , hence new client will be created over and over again.
Expected Behavior
Actual Behavior
The admin.Client is created on every "getQueueLength" request hence cause Azure ARM calls to be throttled with "429-Too many requests "
when we query Azure ARM Endpoint for investigation we found the following :
HttpIncomingRequests | where TIMESTAMP > ago(1d)| where subscriptionId == "24237b72-8546-4da5-b204-8c3cb76dd930" | summarize count() by operationName | order by count_ desc
POST/SUBSCRIPTIONS/RESOURCEGROUPS/PROVIDERS/MICROSOFT.SERVICEBUS/NAMESPACES/QUEUES/PROVIDERS/MICROSOFT.AUTHORIZATION/CHECKACCESS - 709361 calls
so CheckAccess API has 709K calls per days , even when no pods are running.
Steps to Reproduce the Problem
image 1 : queue length is empty
image 2: number of calls is 250-300 per minute.
Logs from KEDA operator
KEDA Version
2.9.3
Kubernetes Version
1.24
Platform
Microsoft Azure
Scaler Details
Azure Service Bus
Anything else?