kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.42k stars 1.06k forks source link

Fallback doesn't work in case of RabbitMQ connection failure #5525

Closed s-shirayama closed 1 month ago

s-shirayama commented 8 months ago

Report

If KEDA failed to connect to RabbitMQ, we expected fallback to work and for KEDA to scale the deployment to the number of pods defined in .spec.fallback.replicas, but it didn't.

Expected Behavior

I'd expect the deployment to be scaled to the number of pods defined in .spec.fallback.replicas

Actual Behavior

The number of pods was not scaled.

Steps to Reproduce the Problem

  1. Set up ScaledObject for RabbitMQ trigger with fallback configuration
  2. Update RabbitMQ user's password on RabbitMQ Server to make connection failure
  3. Restart keda-operator deployment to refresh the connection

This is ScaledObject spec.

spec:
  minReplicaCount: 1
  maxReplicaCount: 3
  fallback:
    failureThreshold: 3
    replicas: 2
  scaleTargetRef:
    name: nginx
  triggers:
  - type: rabbitmq
    metadata:
      host: amqp://{user}:{pass}@{host}:{port}/
      protocol: amqp
      queueName: hello
      mode: QueueLength
      value: "1"

It worked as expected after Step 1.

❯ k get so
NAME                    SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   TRIGGERS   AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
rabbitmq-scaledobject   apps/v1.Deployment   nginx             1     3     rabbitmq                    True    True     False      Unknown   2m23s

❯ k get hpa
NAME                             REFERENCE          TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-rabbitmq-scaledobject   Deployment/nginx   1/1 (avg)   1         3         1          2m27s

After Step 3, we can see connection error on Kubernetes event for ScaledObject, but "Number Of Failures" remains 0 and fallback didn't work (= it didn't scale the deployment to 2 pods.)

❯ k rollout restart deploy -n keda keda-operator
deployment.apps/keda-operator restarted

❯ k describe so
:
Spec:
  Fallback:
    Failure Threshold:  3
    Replicas:           2
  Max Replica Count:    3
  Min Replica Count:    1
Status:
  Conditions:
    Message:  failed to ensure HPA is correctly created for ScaledObject
    Reason:   ScaledObjectCheckFailed
    Status:   False
    Type:     Ready
    Message:  ScaledObject check failed
    Reason:   UnknownState
    Status:   Unknown
    Type:     Active
    Message:  No fallbacks are active on this scaled object
    Reason:   NoFallbackFound
    Status:   False
    Type:     Fallback
    Status:   Unknown
    Type:     Paused
  Health:
    s0-rabbitmq-hello:
      Number Of Failures:  0
      Status:              Happy
:
Events:
  Type     Reason                   Age                    From           Message
  ----     ------                   ----                   ----           -------
  Normal   KEDAScalersStarted       6m56s                  keda-operator  Started scalers watch
  Normal   ScaledObjectReady        6m56s                  keda-operator  ScaledObject is ready for scaling
  Normal   KEDAScalersStarted       4m38s (x7 over 6m56s)  keda-operator  Scaler rabbitmq is built.
  Warning  ScaledObjectCheckFailed  2s (x4 over 12s)       keda-operator  failed to ensure HPA is correctly created for ScaledObject
  Warning  KEDAScalerFailed         1s (x6 over 12s)       keda-operator  error establishing rabbitmq connection: Exception (403) Reason: "username or password not allowed"

❯ k get hpa
NAME                             REFERENCE          TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-rabbitmq-scaledobject   Deployment/nginx   1/1 (avg)   1         3         1          7m23s  

Logs from KEDA operator

2024-02-21T07:17:10Z    ERROR   Error getting scalers   {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"rabbitmq-scaledobject","namespace":"default"}, "namespace": "default", "name": "rabbitmq-scaledobject", "reconcileID": "e47e4e27-ab59-4700-b2b4-9edbb94dee7d", "error": "error establishing rabbitmq connection: Exception (403) Reason: \"username or password not allowed\""}
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).getScaledObjectMetricSpecs
    /workspace/controllers/keda/hpa.go:219
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).newHPAForScaledObject
    /workspace/controllers/keda/hpa.go:72
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).updateHPAIfNeeded
    /workspace/controllers/keda/hpa.go:150
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).ensureHPAForScaledObjectExists
    /workspace/controllers/keda/scaledobject_controller.go:464
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).reconcileScaledObject
    /workspace/controllers/keda/scaledobject_controller.go:280
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).Reconcile
    /workspace/controllers/keda/scaledobject_controller.go:191
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2024-02-21T07:17:10Z    ERROR   Failed to create new HPA resource   {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"rabbitmq-scaledobject","namespace":"default"}, "namespace": "default", "name": "rabbitmq-scaledobject", "reconcileID": "e47e4e27-ab59-4700-b2b4-9edbb94dee7d", "HPA.Namespace": "default", "HPA.Name": "keda-hpa-rabbitmq-scaledobject", "error": "error establishing rabbitmq connection: Exception (403) Reason: \"username or password not allowed\""}
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).updateHPAIfNeeded
    /workspace/controllers/keda/hpa.go:152
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).ensureHPAForScaledObjectExists
    /workspace/controllers/keda/scaledobject_controller.go:464
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).reconcileScaledObject
    /workspace/controllers/keda/scaledobject_controller.go:280
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).Reconcile
    /workspace/controllers/keda/scaledobject_controller.go:191
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2024-02-21T07:17:10Z    ERROR   failed to check HPA for possible update {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"rabbitmq-scaledobject","namespace":"default"}, "namespace": "default", "name": "rabbitmq-scaledobject", "reconcileID": "e47e4e27-ab59-4700-b2b4-9edbb94dee7d", "error": "error establishing rabbitmq connection: Exception (403) Reason: \"username or password not allowed\""}
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).ensureHPAForScaledObjectExists
    /workspace/controllers/keda/scaledobject_controller.go:466
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).reconcileScaledObject
    /workspace/controllers/keda/scaledobject_controller.go:280
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).Reconcile
    /workspace/controllers/keda/scaledobject_controller.go:191
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2024-02-21T07:17:10Z    ERROR   failed to ensure HPA is correctly created for ScaledObject  {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"rabbitmq-scaledobject","namespace":"default"}, "namespace": "default", "name": "rabbitmq-scaledobject", "reconcileID": "e47e4e27-ab59-4700-b2b4-9edbb94dee7d", "error": "error establishing rabbitmq connection: Exception (403) Reason: \"username or password not allowed\""}
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).Reconcile
    /workspace/controllers/keda/scaledobject_controller.go:193
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2024-02-21T07:17:10Z    ERROR   Reconciler error    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"rabbitmq-scaledobject","namespace":"default"}, "namespace": "default", "name": "rabbitmq-scaledobject", "reconcileID": "e47e4e27-ab59-4700-b2b4-9edbb94dee7d", "error": "error establishing rabbitmq connection: Exception (403) Reason: \"username or password not allowed\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2024-02-21T07:17:23Z    ERROR   scale_handler   error resolving auth params {"type": "ScaledObject", "namespace": "default", "name": "rabbitmq-scaledobject", "triggerIndex": 0, "error": "error establishing rabbitmq connection: Exception (403) Reason: \"username or password not allowed\""}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).buildScalers
    /workspace/pkg/scaling/scalers_builder.go:99
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).performGetScalersCache
    /workspace/pkg/scaling/scale_handler.go:357
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalersCacheForScaledObject
    /workspace/pkg/scaling/scale_handler.go:290
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).GetScaledObjectMetrics
    /workspace/pkg/scaling/scale_handler.go:429
github.com/kedacore/keda/v2/pkg/metricsservice.(*GrpcServer).GetMetrics
    /workspace/pkg/metricsservice/server.go:47
github.com/kedacore/keda/v2/pkg/metricsservice/api._MetricsService_GetMetrics_Handler
    /workspace/pkg/metricsservice/api/metrics_grpc.pb.go:99
google.golang.org/grpc.(*Server).processUnaryRPC
    /workspace/vendor/google.golang.org/grpc/server.go:1372
google.golang.org/grpc.(*Server).handleStream
    /workspace/vendor/google.golang.org/grpc/server.go:1783
google.golang.org/grpc.(*Server).serveStreams.func2.1
    /workspace/vendor/google.golang.org/grpc/server.go:1016

KEDA Version

2.13.0

Kubernetes Version

1.27

Platform

Other

Scaler Details

RabbitMQ Queue

Anything else?

No response

JorTurFer commented 7 months ago

Hello Could you confirm what KEDA version you're using? I'm trying to reproduce the issue with v2.13 and I can't:

2024-02-23T21:30:46Z    INFO    scaleexecutor   Successfully set ScaleTarget replicas count to ScaledObject fallback.replicas   {"scaledobject.Name": "rabbitmq-consumer", "scaledObject.Namespace": "default", "scaleTarget.Name": "rabbitmq-consumer", "Original Replicas Count": 1, "New Replicas Count": 2}
2024-02-23T21:30:51Z    ERROR   scale_handler   error resolving auth params {"type": "ScaledObject", "namespace": "default", "name": "rabbitmq-consumer", "triggerIndex": 0, "error": "error establishing rabbitmq connection: Exception (403) Reason: \"username or password not allowed\""}

Basically, after changing the user, the connection is closed and the fallback modifies the replica count

JorTurFer commented 7 months ago

the logs you sent are basically 2 (based on the timestamp), so it's normal that having the threshold set on 3 the fallback hasn't been triggered

s-shirayama commented 7 months ago

Hi, I used v2.13 as described in the description of this issue.

Basically, after changing the user, the connection is closed and the fallback modifies the replica count

This could be the difference from my environment.

so it's normal that having the threshold set on 3 the fallback hasn't been triggered

Even after waiting several minutes, the status was the same, which means that "Number Of Failures" remained 0 and fallback didn't work.


As mentioned on RabbitMQ doc,

RabbitMQ may cache the results of access control checks on a per-connection or per-channel basis. Hence changes to user permissions may only take effect when the user reconnects.

the existing connection was kept and working properly even after updating RabbitMQ user's password on RabbitMQ Admin UI. So I needed to rollout keda-operator deployment to refresh the connection.

I was running a simple rabbitmq container without a special configuration to check the behavior.

kubectl run rabbitmq --image=rabbitmq:management

I'm not so familiar with RabbitMQ yet, but do you think there is any RabbitMQ configuration that makes this difference in the behavior?

JorTurFer commented 7 months ago

I'm not so familiar with RabbitMQ yet

Me neither, sorry :( Tomorrow I'll try again restarting KEDA as you described in the reproduction path:

1. Set up ScaledObject for RabbitMQ trigger with fallback configuration
2. Update RabbitMQ user's password on RabbitMQ Server to make connection failure
3. Restart keda-operator deployment to refresh the connection

It's true that I didn't restart KEDA and maybe the difference could be there. Thanks for confirming the version :)

s-shirayama commented 7 months ago

Hi @JorTurFer , is there any update on this?

I can provide more info if you need it, so please let me know in that case.

JorTurFer commented 7 months ago

Hi @JorTurFer , is there any update on this?

Sorry, I tested the scenario but I didn't answer 🤦 You are right and errors during init doesn't trigger the fallback. I don't remember if we didn't count these errors intentionally or if it's a mistake. @zroubalik ?

s-shirayama commented 6 months ago

Hi @JorTurFer and @zroubalik, I found a similar issue on Kafka Scaler (v2.13.0) as well. So this could be not RabbitMQ scaler-specific issue.

I expected fallback works with these steps for Kafka scaler, but it didn't (the same situation described in this issue).

  1. Set up ScaledObject for Kafka trigger with fallback configuration
  2. (After HPA begins to work normally) Delete Kafka cluster to make connection failure
JorTurFer commented 6 months ago

yeah, I told it xD

You are right and errors during init doesn't trigger the fallback.

@zroubalik , is this intended for any reason that I don't remember?

zroubalik commented 6 months ago

I think that this is a corner case we didn't expect. Normally, if user is about to create ScaledObject with incorrect information - it won't proceed (won't create a scale loop) and it fails as expected. If there's a failure during the operation of KEDA, the fallback should happen (we are already in the scale loop context).

But in your certain usecase when you restarted KEDA operator during the failure, I think that we end up in a situation when KEDA operator is trying to start reconcile on a new ScaleObject (and start a new scale loop) so it doesn't have a chance to actually go into fallback (because internally we haven't started a scale loop).

JorTurFer commented 6 months ago

so, should we cover it?

zroubalik commented 6 months ago

yes

stale[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

zroubalik commented 4 months ago

We should probably check the status of ScaledObject - if it has been set to fallback, we should probably proceed and create the connection.

marandalucas commented 4 months ago

@zroubalik Thank you so much for this amazing product

I just came along to say we are having the same issue. 👍 It would be awesome to have this feature working.

What is the estimation to fix it?

thanks!

zroubalik commented 3 months ago

@marandalucas unfortunately at the moment there's nobody assigned to work on this, are you up for contributing this?

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 month ago

This issue has been automatically closed due to inactivity.