kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.29k stars 1.05k forks source link

Keda operator fails with "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w(<nil>)" in v.2.15.1 using Azure event Hub trigger #6084

Open chamindac opened 4 weeks ago

chamindac commented 4 weeks ago

I have keda deployed with version v2.15.1 on AKS using work load identity. AKS k8s version is 1.29.7. My scaled job trigges based on azure event hub. Keda operator shows issue "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w()"

The setup was working fine with KEDA v2.14.2 on AKS using work load identity. AKS k8s version is 1.29.7.

Scled job shows below issues

Status:
  Conditions:
    Message:  Some triggers defined in ScaledJob are not working correctly
    Reason:   PartialTriggerError
    Status:   Unknown
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Status:   Unknown
    Type:     Fallback
    Status:   Unknown
    Type:     Paused
Events:
  Type     Reason              Age                     From           Message
  ----     ------              ----                    ----           -------
  Normal   KEDAScalersStarted  7m16s (x4 over 7m16s)   scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  7m16s                   scale-handler  Started scalers watch
  Normal   ScaledJobReady      7m16s                   keda-operator  ScaledJob is ready for scaling
  Warning  KEDAScalerFailed    7m16s (x2 over 7m16s)   scale-handler  unable to get runtimeInfo for metrics: context canceled
  Warning  KEDAScalerFailed    2m16s (x61 over 7m14s)  scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w(<nil>)

The keda operator pod log shows below

2024-08-16T12:57:17Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "mydemo-scaledjob", "scaledJob.Namespace": "avalanche", "Number of running Jobs": 0}
2024-08-16T12:57:17Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "mydemo-scaledjob", "scaledJob.Namespace": "avalanche", "Number of pending Jobs": 0}
2024-08-16T12:57:22Z    ERROR   scale_handler   Error getting scaler metrics and activity, but continue {"scaledJob.Name": "mydemo-scaledjob", "Scaler": "*scalers.azureEventHubScaler:", "error": "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w(<nil>)"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:853
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182

If I deploy KEDA v2.14.2 or v2.14.3 on top of v2.15.1 without changing anything else in my setup everything starts to work fine. and status of my scaled job comes back to normal as below log shows.

Status:
  Conditions:
    Message:  ScaledJob is defined correctly and is ready to scaling
    Reason:   ScaledJobReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Status:   Unknown
    Type:     Fallback
    Status:   Unknown
    Type:     Paused
Events:
  Type     Reason              Age                 From           Message
  ----     ------              ----                ----           -------
  Normal   KEDAScalersStarted  20m (x4 over 20m)   scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  20m                 scale-handler  Started scalers watch
  Normal   ScaledJobReady      20m                 keda-operator  ScaledJob is ready for scaling
  Warning  KEDAScalerFailed    20m (x2 over 20m)   scale-handler  unable to get runtimeInfo for metrics: context canceled
  Warning  KEDAScalerFailed    19m (x18 over 20m)  scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w(<nil>)
  Normal   KEDAScalersStarted  16m (x2 over 16m)   scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  16m                 scale-handler  Started scalers watch
  Normal   ScaledJobReady      16m                 keda-operator  ScaledJob is ready for scaling
  Normal   KEDAJobsCreated     16m                 scale-handler  Created 1 jobs
  Normal   KEDAScalersStarted  14m (x2 over 14m)   scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  14m                 scale-handler  Started scalers watch
  Normal   KEDAJobsCreated     12m (x22 over 14m)  scale-handler  Created 0 jobs

Below are more information on my setup.

I deployed keda using below

helm repo add kedacore https://kedacore.github.io/charts
            helm repo update

            helm upgrade keda kedacore/keda --install `
              --namespace keda `
              --version 2.15.1 `
              --set serviceAccount.operator.create=true `
              --set serviceAccount.operator.name=keda-operator `
              --set podIdentity.azureWorkload.enabled=true `
              --set podIdentity.azureWorkload.clientId=$(sys_aks_uai_client_id) `
              --set podIdentity.azureWorkload.tenantId=$(tenantid)

KEDA triiger auth setup as

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: av-keda-trigger-auth
  namespace: mynamespace
spec:
  podIdentity:
    provider: azure-workload

My scaled job triggers

triggers:
    - type: azure-eventhub
      metadata:
        consumerGroup: largevideogenerator
        unprocessedEventThreshold: "1"
        activationUnprocessedEventThreshold: "0"
        blobContainer: largevideogenerator-largevideogenerationrequired
        eventHubNamespace: myeventhubnamespace
        eventHubName: largevideogenerationrequired
        storageAccountName: mystoragename
        checkpointStrategy: blobMetadata
      authenticationRef:
        name: av-keda-trigger-auth
    - type: azure-eventhub
      metadata:
        consumerGroup: largevideogenerator
        unprocessedEventThreshold: "1"
        activationUnprocessedEventThreshold: "0"
        blobContainer: largevideogenerator-regeneratelargevideo
        eventHubNamespace: myeventhubnamespace
        eventHubName: regeneratelargevideo
        storageAccountName: mystoragename
        checkpointStrategy: blobMetadata
      authenticationRef:
        name: av-keda-trigger-auth

I can provide more information and logs if required.

In summary this is what happens

chamindac commented 3 weeks ago

Is this due to AKS kubernetes version compatibility with KEDA version? From documentation here it seems the KEDA add on uses AKS kubernetes 1.30 with KEDA 2.14.. and KEDA 2.15 is to be used in AKS kubernets 1.31

So, when we deploy KEDA to AKS, without using AKS add on for KEDA, should we consider the same versions, as used by add on depending on AKS kubernetes version?

For now as a solution for my problem I am going to stay with KEDA 2.14 until I upgrade my AKS to use kubernetes 1.31, before retrying KEDA 2.15

JorTurFer commented 3 weeks ago

Hello I can't reproduce the issue. I've included a specific e2e test case to cover it but it passes, this is the trigger configuration): https://github.com/kedacore/keda/blob/fc002f0739b2b7d41160c5192abd6a7fcb1db28c/tests/scalers/azure/azure_event_hub_blob_metadata_wi/azure_event_hub_blob_metadata_wi_test.go#L135-L143

Could you share the blob metadata? image

chamindac commented 3 weeks ago

Hi.. below is checkpoint blob metadata image

JorTurFer commented 3 weeks ago

I've found that the error is wrongly handled and that's why you see without any extra info. I've created a PR fixing the error. Are you willing to try with the fixed tag? it's ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152. It's built from main so it's almost v2.15.1. This is the change to improve the info: image

I think that with this change we will see extra info about the error

chamindac commented 2 weeks ago

@JorTurFer thank you for response.. I have moved on to use the managed add on for KEDA for AKS. So, I am currently on AKS with kubernetes 1.30.3 with KEDA 2.14.

However I will try to create a test environment and test the fixed version of KEDA and get back to you

chamindac commented 1 week ago

@JorTurFer I tried deploying with ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152 using keda-2.15.1.yaml (changing keda operator image as shown below)

image: ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152 # ghcr.io/kedacore/keda:2.15.1 # chaminda
        imagePullPolicy: Always

The keda operator crashloopback off with below in logs of keda-operator pod

2024/09/02 09:07:25 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
2024-09-02T09:07:25Z    INFO    setup   Starting manager
2024-09-02T09:07:25Z    INFO    setup   KEDA Version: pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152
2024-09-02T09:07:25Z    INFO    setup   Git Commit: 4776d09c8fd761814c1eb9ba7e964ceace651152
2024-09-02T09:07:25Z    INFO    setup   Go Version: go1.22.5
2024-09-02T09:07:25Z    INFO    setup   Go OS/Arch: linux/amd64
2024-09-02T09:07:25Z    INFO    setup   Running on Kubernetes 1.30      {"version": "v1.30.3"}
2024-09-02T09:07:26Z    INFO    controller-runtime.metrics      Starting metrics server
2024-09-02T09:07:26Z    INFO    controller-runtime.metrics      Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-09-02T09:07:26Z    INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
I0902 09:07:26.032037       1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
I0902 09:07:41.268027       1 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "source": "kind source: *v1alpha1.CloudEventSource"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "clustercloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "ClusterCloudEventSource", "source": "kind source: *v1alpha1.ClusterCloudEventSource"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "clustercloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "ClusterCloudEventSource"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "cert-rotator"}
2024-09-02T09:07:41Z    INFO    cert-rotation   starting cert rotator controller
2024-09-02T09:07:41Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:07:41Z    INFO    cert-rotation   no cert refresh needed
2024-09-02T09:07:41Z    INFO    cert-rotation   certs are ready in /certs
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "cert-rotator", "worker count": 1}
2024-09-02T09:07:41Z    INFO    cert-rotation   no cert refresh needed
2024-09-02T09:07:41Z    INFO    cert-rotation   Ensuring CA cert        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-09-02T09:07:41Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-09-02T09:07:41Z    INFO    cert-rotation   no cert refresh needed
2024-09-02T09:07:41Z    INFO    cert-rotation   Ensuring CA cert        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-09-02T09:07:41Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "worker count": 1}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
2024-09-02T09:07:42Z    INFO    cert-rotation   CA certs are injected to webhooks
2024-09-02T09:07:42Z    INFO    grpc_server     Starting Metrics Service gRPC Server    {"address": ":9666"}
2024-09-02T09:07:51Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:01Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:11Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:21Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:31Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:41Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:51Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:01Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:11Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:21Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:31Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:41Z    ERROR   Could not wait for Cache to sync        {"controller": "clustercloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "ClusterCloudEventSource", "error": "failed to wait for clustercloudeventsource caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.ClusterCloudEventSource"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:203
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223
2024-09-02T09:09:41Z    INFO    Stopping and waiting for non leader election runnables
2024-09-02T09:09:41Z    INFO    Stopping and waiting for leader election runnables
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "cert-rotator"}
2024-09-02T09:09:41Z    INFO    cert-rotation   stopping cert rotator controller
W0902 09:09:41.269909       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.269969       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of apiregistration.k8s.io/v1, Kind=APIService ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270024       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "cert-rotator"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-09-02T09:09:41Z    INFO    Stopping and waiting for caches
W0902 09:09:41.270185       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ClusterTriggerAuthentication ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270224       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.CloudEventSource ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270262       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ScaledJob ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270297       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ScaledObject ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270339       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.TriggerAuthentication ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270400       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v2.HorizontalPodAutoscaler ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2024-09-02T09:09:41Z    INFO    Stopping and waiting for webhooks
2024-09-02T09:09:41Z    INFO    Stopping and waiting for HTTP servers
2024-09-02T09:09:41Z    INFO    shutting down server    {"kind": "health probe", "addr": "[::]:8081"}
2024-09-02T09:09:41Z    INFO    controller-runtime.metrics      Shutting down metrics server with timeout of 1 minute
2024-09-02T09:09:41Z    INFO    Wait completed, proceeding to shutdown the manager
2024-09-02T09:09:41Z    ERROR   setup   problem running manager {"error": "failed to wait for clustercloudeventsource caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.ClusterCloudEventSource"}
main.main
        /workspace/cmd/operator/main.go:329
runtime.main
        /usr/local/go/src/runtime/proc.go:271
JorTurFer commented 1 week ago

Oh, sorry, we introduced a new CRD (that'll be ship with v2.16), this is the CRD that you need to deploy into the cluster too -> https://github.com/kedacore/keda/blob/main/config/crd/bases/eventing.keda.sh_clustercloudeventsources.yaml It's for the CloudEvent integration, so probably it doesn't matter in your case xD

chamindac commented 1 week ago

@JorTurFer with the CRD deployed now keda operator seems to be needning some additional permissions

"system:serviceaccount:keda:keda-operator" is the service account I am using for enabling workload identity. With this CRD does the workload identity require any additional permisions in Azure resources or for AKS cluster?


2024-09-03T09:51:19Z    INFO    cert-rotation   Ensuring CA cert        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-09-03T09:51:19Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-09-03T09:51:19Z    INFO    cert-rotation   no cert refresh needed
2024-09-03T09:51:19Z    INFO    cert-rotation   Ensuring CA cert        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-09-03T09:51:19Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-09-03T09:51:20Z    INFO    cert-rotation   CA certs are injected to webhooks
2024-09-03T09:51:20Z    INFO    grpc_server     Starting Metrics Service gRPC Server    {"address": ":9666"}
W0903 09:51:20.604804       1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
E0903 09:51:20.604845       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch *v1alpha1.ClusterCloudEventSource: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
W0903 09:51:22.747786       1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
E0903 09:51:22.747830       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch *v1alpha1.ClusterCloudEventSource: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
W0903 09:51:26.531065       1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
E0903 09:51:26.531111       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch *v1alpha1.ClusterCloudEventSource: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
JorTurFer commented 1 week ago

yes, I forgot it, sorry. This permissions have to be added to KEDA's Cluster Role as it needs read the CRD image

chamindac commented 1 week ago

@JorTurFer With the changes you mentioned above, I managed to run keda operator with your tag ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152 usingkeda-2.15.1.yaml

The issue seems to be with 2.15.1 the keda-operator and the event hub trigger is looking for none existing checkpoint blob.

For example here are my two scaled jobs current checkpoint blobs

largepreview-scaledjob There is no checkpoint/7 blob but scaleedjob trigger and keda operator is looking for such a blob

image

As per scaled job log it is looking for checkpoint blob 7

Events:
  Type     Reason              Age                From           Message
  ----     ------              ----               ----           -------
  Normal   KEDAScalersStarted  25m (x4 over 25m)  scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  25m                scale-handler  Started scalers watch
  Normal   ScaledJobReady      25m                keda-operator  ScaledJob is ready for scaling
  Warning  KEDAScalerFailed    25m (x2 over 25m)  scale-handler  unable to get runtimeInfo for metrics: context canceled
  Warning  KEDAScalerFailed    19m                scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largepreviewgenerator-largepreviewrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largepreviewrequired/largepreviewgenerator/checkpoint/7
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:56c49771-901e-0026-5c78-ff2f40000000
Time:2024-09-05T09:44:10.8104812Z</Message></Error>
--------------------------------------------------------------------------------
  Warning  KEDAScalerFailed  19m  scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largepreviewgenerator-largepreviewrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largepreviewrequired/largepreviewgenerator/checkpoint/7

largevideo-scaledjob There is no checkpoint/0 blob but scaleedjob trigger and keda operator is looking for such a blob

image

As per scaled job log it is looking for checkpoint blob 0

Events:
  Type     Reason              Age                From           Message
  ----     ------              ----               ----           -------
  Normal   KEDAScalersStarted  34m (x4 over 34m)  scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  34m                scale-handler  Started scalers watch
  Normal   ScaledJobReady      34m                keda-operator  ScaledJob is ready for scaling
  Warning  KEDAScalerFailed    34m (x2 over 34m)  scale-handler  unable to get runtimeInfo for metrics: context canceled
  Warning  KEDAScalerFailed    27m                scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largevideogenerator-largevideogenerationrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largevideogenerationrequired/largevideogenerator/checkpoint/0
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:36a080bb-701e-0063-5178-fffaa3000000
Time:2024-09-05T09:44:47.6370308Z</Message></Error>
--------------------------------------------------------------------------------
  Warning  KEDAScalerFailed  26m  scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largevideogenerator-largevideogenerationrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largevideogenerationrequired/largevideogenerator/checkpoint/0
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound

Both of the scaled jobs showing same symptoms only with 2.15.1 and failing by looking for none existing checkpoint blob name. The keda operator (with tag ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152) shows below logs for the scaled jobs agian showning looking for none exsiting blobs

2024-09-05T09:50:22Z    ERROR   scale_handler   Error getting scaler metrics and activity, but continue {"scaledJob.Name": "largevideo-scaledjob", "Scaler": "*scalers.azureEventHubScaler:", "error": "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largevideogenerator-largevideogenerationrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largevideogenerationrequired/largevideogenerator/checkpoint/0\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 The specified blob does not exist.\nERROR CODE: BlobNotFound\n--------------------------------------------------------------------------------\n<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.\nRequestId:dfd7075d-101e-0007-1779-ff0b3b000000\nTime:2024-09-05T09:50:22.6388386Z</Message></Error>\n--------------------------------------------------------------------------------\n"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:853
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182
2024-09-05T09:50:22Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "largevideo-scaledjob", "scaledJob.Namespace": "mynamespace", "Number of running Jobs": 0}
2024-09-05T09:50:22Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "largevideo-scaledjob", "scaledJob.Namespace": "mynamespace", "Number of pending Jobs": 0}
2024-09-05T09:50:25Z    ERROR   scale_handler   Error getting scaler metrics and activity, but continue {"scaledJob.Name": "largepreview-scaledjob", "Scaler": "*scalers.azureEventHubScaler:", "error": "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largepreviewgenerator-largepreviewrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largepreviewrequired/largepreviewgenerator/checkpoint/7\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 The specified blob does not exist.\nERROR CODE: BlobNotFound\n--------------------------------------------------------------------------------\n<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.\nRequestId:7f71945f-f01e-0052-6279-ff1bb0000000\nTime:2024-09-05T09:50:25.6750071Z</Message></Error>\n--------------------------------------------------------------------------------\n"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:853
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182
2024-09-05T09:50:25Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "largepreview-scaledjob", "scaledJob.Namespace": "mynamespace", "Number of running Jobs": 0}
2024-09-05T09:50:25Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "largepreview-scaledjob", "scaledJob.Namespace": "mynamespace", "Number of pending Jobs": 0}
2024-09-05T09:50:27Z    ERROR   scale_handler   Error getting scaler metrics and activity, but continue {"scaledJob.Name": "largevideo-scaledjob", "Scaler": "*scalers.azureEventHubScaler:", "error": "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largevideogenerator-largevideogenerationrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largevideogenerationrequired/largevideogenerator/checkpoint/0\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 The specified blob does not exist.\nERROR CODE: BlobNotFound\n--------------------------------------------------------------------------------\n<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.\nRequestId:ba2246b8-801e-0005-6179-ffb583000000\nTime:2024-09-05T09:50:27.6339979Z</Message></Error>\n--------------------------------------------------------------------------------\n"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:853
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182

When I deploy KEDA with keda-2.14.1.yaml (or 2.14.2 helm chart or 2.14.3 helm chart) and setup everything else (triggers, scaled job setting) configured exactly same way, there is no such logs for looking for unavaialbe blobs for checkpoints. My scale jobs working as expected without any issues with exact same setup with KEDA 2.14.x.

The issue above is only happening with 2.15.1

I suspect KEDA 2.15.1 is not refreshing the storage check point blob list correctly before checking for checkpoint blob metadata. While 2.14.x KEDA is not having the problem.

JorTurFer commented 2 days ago

We introduced a bug when we upgrade the SDK but I think that this PR will solve the issue -> https://github.com/kedacore/keda/pull/6096

Are you willing to test the fix? This is the tag with the fix -> ghcr.io/kedacore/keda-test:pr-6096-9b8be4868a27c304646cf8cb0735357eb272bd38

chamindac commented 1 day ago

@JorTurFer The PR #6096 seems to have fixed the issue. I deployed ghcr.io/kedacore/keda-test:pr-6096-9b8be4868a27c304646cf8cb0735357eb272bd38 to my environment with keda-2.15.1.yaml, and for last 24 hours eventhub scaler works as expected.

Will you be releasing a fixed version for 2.15.x or is this issue going to be fixed only with 2.16.x. Just want to know if we will have to skip using 2.15.x (it is impossible to use with this issue) and wait for 2.16.x ?