Keda Operator - invalid memory address or nil pointer dereference during large scaler failures

jark-AB commented 1 month ago

Report

KEDA Version: 2.15.1 Running on Kubernetes 1.29+ {"version": "v1.29.7-eks-a18cd3a"}

I am running into a transient issue where i have multiple scaledObjects (464) in many namespaces. We use metricsAPI scaler to poll against an api pod in each namespace that dynamically scales off an endpoint.

This works extremely well and at scale. However, what we noticed is if there is an issue where multiple scaledObjects are failing at once (such as a networking or control plane issue), the Operator will run into a null pointer reference when it runs into many variations of:

 keda-operator-548d6df695-qfzgn keda-operator ERROR scale_handler   error getting scale decision    {"scaledObject.Namespace": "se-mcondon", "scaledObject.Name": "worker-powerbi-scaler", "scaler": "metricsAPIScaler", "error": "error requesting metrics endpoint: Get \"http://api.se-mcondon.svc.cluster.local:9001/api/v1/metrics\": dial tcp 10.100.28.160:9001: connect: connection refused"}

Can replace namespace with any of the 150 namespaces etc. This results in the Keda Operator restarting and prior to a restart the following nil pointer reference logs are shown:

Expected Behavior

I'd expect the Operator to not run into null pointers when Scalers fail at a large amount or run into transient issues.

Actual Behavior

Operator cannot get a scaling decision, spread across 500 scaledObjects and runs into nill pointer panic and restarts

Steps to Reproduce the Problem

Deploy Keda Operator
Launch 500 scaledObjects
Run a situation where the API endpoint it is attempting to scale from refuses a connection or results in errors in getting scaled decision (such as mass deleting the api pods etc.)

Logs from KEDA operator

Sep 18 14:25:26 keda-operator-548d6df695-qfzgn keda-operator error panic: runtime error: invalid memory address or nil pointer dereference
Sep 18 14:25:26 keda-operator-548d6df695-qfzgn keda-operator [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1efa392]
Sep 18 14:25:26 keda-operator-548d6df695-qfzgn keda-operator goroutine 3144 [running]:
Sep 18 14:25:26 keda-operator-548d6df695-qfzgn keda-operator github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).updateScaleOnScaleTarget(0xc000c3c190?, {0x4a92e40, 0xc00a599c20}, 0xc00aa2d8c8, 0xc01b37fd36?, 0x1?)
    /workspace/pkg/scaling/executor/scale_scaledobjects.go:349 +0xb2
Sep 18 14:25:26 keda-operator-548d6df695-qfzgn keda-operator github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).doFallbackScaling(0xc000c3c190, {0x4a92e40, 0xc00a599c20}, 0xc00aa2d8c8, 0x4a92e40?, {{0x4a9cdf8?, 0xc00ac34ae0?}, 0x43ec18c?}, 0x1)
    /workspace/pkg/scaling/executor/scale_scaledobjects.go:234 +0x76
Sep 18 14:25:26 keda-operator-548d6df695-qfzgn keda-operator github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).RequestScale(0xc000c3c190, {0x4a92e40, 0xc00a599c20}, 0xc00aa2d8c8, 0x0, 0x1, 0xc00a9f4090)
    /workspace/pkg/scaling/executor/scale_scaledobjects.go:169 +0xee5
Sep 18 14:25:26 keda-operator-548d6df695-qfzgn keda-operator github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers(0xc000661340, {0x4a92e40, 0xc00a599c20}, {0x4337560, 0xc00aa2d8c8}, {0x4a67b08, 0xc00aa467c0})
    /workspace/pkg/scaling/scale_handler.go:249 +0x455
Sep 18 14:25:26 keda-operator-548d6df695-qfzgn keda-operator github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop(0xc000661340, {0x4a92e40, 0xc00a599c20}, 0xc00aa243c0, {0x4337560, 0xc00aa2d8c8}, {0x4a67b08, 0xc00aa467c0}, 0x1)
    /workspace/pkg/scaling/scale_handler.go:182 +0x3eb
Sep 18 14:25:26 keda-operator-548d6df695-qfzgn keda-operator created by github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).HandleScalableObject in goroutine 413
    /workspace/pkg/scaling/scale_handler.go:128 +0x4ce

KEDA Version

2.15.1

Kubernetes Version

1.29

Platform

Amazon Web Services

Scaler Details

MetricsAPI

Anything else?

No response

wozniakjan commented 3 weeks ago

today is a community bi-weekly call, I will add it to the agenda, you are welcome to attend as well :)

tonylee-shopback commented 3 weeks ago

I encountered the similar issue in 2.13.1

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x14a5b20]

goroutine 1133 [running]:
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).RequestScale(0x4000a3b400, {0x41d8d78, 0x401c3c0cd0}, 0x401c390e00, 0x1, 0x0)
    /workspace/pkg/scaling/executor/scale_scaledobjects.go:39 +0x140
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers(0x40009705b0, {0x41d8d78, 0x401c3c0cd0}, {0x3ab0040?, 0x401c390e00?}, {0x41ae5d0, 0x401c7ce110})
    /workspace/pkg/scaling/scale_handler.go:249 +0x334
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop(0x4015aaecc0?, {0x41d8d78, 0x401c3c0cd0}, 0x401c7de000, {0x3ab0040, 0x401c390e00}, {0x41ae5d0, 0x401c7ce110}, 0xa0?)
    /workspace/pkg/scaling/scale_handler.go:182 +0x374
created by github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).HandleScalableObject in goroutine 341
    /workspace/pkg/scaling/scale_handler.go:128 +0x468

wozniakjan commented 2 weeks ago

this was briefly discussed last week, possibly an issue with trigger cache invalidation. It may take some time before it gets fixed.

kedacore / keda