Azure Service Bus Scaler Sends GetAccess Requests on every getQueueLength message

tshaiman commented 1 year ago

Keda Service Bus Scaler keeps asking for Arm GetAccess on Service Bus -> no reused of AdminClient

we saw a significant increase of getAcccess On Azure ARM Endpoint, aiming to authenticate that the request has access to the service bus.

This happens when all the pods are at scale zero
we have 16 queues , with 8 environments , with polling interval of 30 seconds => that is around 256 calls Per minute
Looking at the Keda Service bus Scaler code we see the following

azure_servicebus_scaler.go

func (s *azureServiceBusScaler) getAzureServiceBusLength(ctx context.Context) (int64, error) {
    // get adminClient
    adminClient, err := s.getServiceBusAdminClient()
    if err != nil {
        return -1, err
    }
    // switch case for queue vs topic here
    switch s.metadata.entityType {
    case queue:
        return getQueueLength(ctx, adminClient, s.metadata)
    case subscription:
        return getSubscriptionLength(ctx, adminClient, s.metadata)
    default:
        return -1, fmt.Errorf("no entity type")
    }
}

a servicebusAdminClient is created but it is not assigned to any varaible on "s" (azureServiceBusScaler)

func (s *azureServiceBusScaler) getServiceBusAdminClient() (*admin.Client, error) {
    if s.client != nil {
        return s.client, nil
    }
  .......//Creating the client
}

so each time a check is made to see if s.client is initialized , it is always be null , hence new client will be created over and over again.

Expected Behavior

The admin.Client needs to be stored in the azureServiceBusScaler object and not be created every time

Actual Behavior

The admin.Client is created on every "getQueueLength" request hence cause Azure ARM calls to be throttled with "429-Too many requests "
when we query Azure ARM Endpoint for investigation we found the following :

HttpIncomingRequests | where TIMESTAMP > ago(1d)| where subscriptionId == "24237b72-8546-4da5-b204-8c3cb76dd930" | summarize count() by operationName | order by count_ desc

POST/SUBSCRIPTIONS/RESOURCEGROUPS/PROVIDERS/MICROSOFT.SERVICEBUS/NAMESPACES/QUEUES/PROVIDERS/MICROSOFT.AUTHORIZATION/CHECKACCESS - 709361 calls

so CheckAccess API has 709K calls per days , even when no pods are running.

Steps to Reproduce the Problem

run Keda with at least 160 queues and use polling interval of 30 seconds/ or use less queues with low polling interval
Goto Azure Portal and view the number of Requests on the Service Bus Queue :

image 1 : queue length is empty

image 2: number of calls is 250-300 per minute.

ensure all calls are POST/SUBSCRIPTIONS/RESOURCEGROUPS/PROVIDERS/MICROSOFT.SERVICEBUS/NAMESPACES/QUEUES/PROVIDERS/MICROSOFT.AUTHORIZATION/CHECKACCESS

Logs from KEDA operator

no errros

KEDA Version

2.9.3

Kubernetes Version

1.24

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

would like to submit a patch after this is confirmed to be a bug

JorTurFer commented 1 year ago

Hello! Nice catch! KEDA asks the queues even if there isn't any pod, but definitively the client should be reused in the scaler

Are you willing to contribute with the fix?

I think that a small refactor of this function could be enough.

Something like this could be enough:

func (s *azureServiceBusScaler) getServiceBusAdminClient() (*admin.Client, error) {
    if s.client != nil {
        return s.client, nil
    }
    var err error
    var client *admin.Client
    switch s.podIdentity.Provider {
    case "", kedav1alpha1.PodIdentityProviderNone:
        client, err = admin.NewClientFromConnectionString(s.metadata.connection, nil)
    case kedav1alpha1.PodIdentityProviderAzure, kedav1alpha1.PodIdentityProviderAzureWorkload:
        creds, chainedErr := azure.NewChainedCredential(s.podIdentity.IdentityID, s.podIdentity.Provider)
        if chainedErr != nil {
            return nil, err
        }
        client, err = admin.NewClient(s.metadata.fullyQualifiedNamespace, creds, nil)
    default:
        err = fmt.Errorf("incorrect podIdentity type")
    }

    s.client = client
    return client, err
}

zroubalik commented 1 year ago

Great find!

xoanmm commented 1 year ago

Hey,

I want to tackle this issue

tshaiman commented 1 year ago

oh man you are quick i was just about to submit a PR 😂

you snooze you loose

tshaiman commented 1 year ago

@JorTurFer / @tomkerkhove / @xoanmm : I'm afriad that the fix in https://github.com/kedacore/keda/pull/4273 is now causing another issue.

I took the fix for a test drive ( build custom docker image) and after 1 hour I started getting Pod Identity Exception I think this is because Azure Tokens are valid for 1 hour and the code does not take this into account but keep using the cached admin client.

peeking at the code of Identity in Keda and with knowledge on how Azure Identity library work I got to say this entire Identity issue needs to be discussed.

having said that the fix is not working for a pod that is longed lived .

JorTurFer commented 1 year ago

What error do you see? In theory, the service bus SDK should call to GetToken when the token has expired. I mean, maybe there is other error, but I think that this implementation is correct as we should cache the client instead of creating a new one each time. KEDA also recreates the scaler when there is any error getting the metrics, so even if the client is cached in the scaler, the whole scaler is regenerated on any error. I think that the error should be in other place.

tshaiman commented 1 year ago

I see the same-old ,most common error that keda does not have a valid token to use. I do agree that we should cache the client , but I was looking how tokens are being asked and used in Keda , which made me rethink about the entire issue we see here ( and in the infamous Workload Identity bugs we saw few months ago ) I started a discussion here : https://github.com/kedacore/keda/discussions/4279

I agree that the SDK needs to call GetToken behind the scenes , but as I opened the custom implementation/wrapper of Identity in keda , as described in the discussion , I'm not sure this will eventually happen since the usage of the well defined contract against Azure .Identity library is not met in this case.

JorTurFer commented 1 year ago

I'm deploying that change into my cluster right now, let's wait until tomorrow to check how it looks. What are you using? Workload Identity or AAD-Pod-Identity?

tshaiman commented 1 year ago

pod-identity , as workload identity had a bug that related to their webhook few months ago

JorTurFer commented 1 year ago

I don't see any error for the moment (76 min):

I'll keep both apps (aad-pod-identity and wi) deployed some days to check if there is any error

JorTurFer commented 1 year ago

Nothing yet (125min), Let's see after 12-24 hours

tshaiman commented 1 year ago

what i did in my test was to replace only the fixed code ( saving admin client ) into v2.9.3 that could be insufficient with more fixes merged to master

tshaiman commented 1 year ago

nice tool !

is this Lens ?

On Sun, Feb 26, 2023 at 4:33 AM Jorge Turrado Ferrero < @.***> wrote:

Nothing yet (125min), Let's see after 12-24 hours [image: image] https://user-images.githubusercontent.com/36899226/221389014-fa42c9bf-c1df-4cb8-9b89-d4b12ccd6297.png

— Reply to this email directly, view it on GitHub https://github.com/kedacore/keda/issues/4262#issuecomment-1445252442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5FRI27ODBM7GKDHKLZ2LWZK6F5ANCNFSM6AAAAAAVDBX4V4 . You are receiving this because you authored the thread.Message ID: @.***>

JorTurFer commented 1 year ago

is this Lens ?

yes, It's openlens (an open source fork of lens). It isn't perfect and sometime I still need kubectl, but for majority of the case works realy nice

JorTurFer commented 1 year ago

Same behavior after 11 hours, I guess that we need to wait until 12/24 hours to ensure that the token is regenerated.

BTW, do you see any error in the aad-pod-identity pods? as it's an authentication proxy, maybe there is any error there that givesd more info about the problems

tshaiman commented 1 year ago

is it 2.9.3 with the fix , or is it s2.9.4 candidate. what i did is to apply the fix on this bug to 2/.9.3 which may be incorrect. can I get an image tag to test it ?

JorTurFer commented 1 year ago

I am using 'main' tag because it's the only that contains the fix for the moment ('main' tag contains the next version changes)

Be careful trying that image because it contains the certificarte management (a new feature introduced to improve the security), so you can't just update the tag, there are required changes in the RBAC to allow this management. Other option is to manage the certificates from your side, but that's more complicated. Next version docs explains it https://keda.sh/docs/2.10/operate/security/#use-your-own-tls-certificates And we will release a blog post explaining how to do it with cert-manager after release

tshaiman commented 1 year ago

ok that is too much for now indeed i will just save the auth client on my 2.9.3 tag and rebuild with ‘make container-build׳

i don’t want you to spend time on this if after 12 hours it works - most chances i didn’t build correctly .

will update shortly

JorTurFer commented 1 year ago

21H without any error:

Tomorrow morning I'll check if there is any error after 24h. I guess that if tomorrow this still works, the access token renewal is done correctly and I can close the issue again 😄 I created 2 ScaledObject, one for add-pod-identity and other one for aad-wi so both are covered with this test

tshaiman commented 1 year ago

@JorTurFer : conducted my tests as well and confirmed to be working correctly. the problem was running against "workload-identity" which is still not fixed completely by workload-identity team with the webhook and its policy .

thanks for all your effots and clarifications

JorTurFer commented 1 year ago

Nice!

kedacore / keda