kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.27k stars 1.05k forks source link

keda operator pod crashes daily once with an error code 2 #5549

Closed mustaFAB53 closed 1 month ago

mustaFAB53 commented 6 months ago

Keda operator pod crashes daily once with an error code 2 even kept ideal (autoscaling got triggered or not) Previous logs showed following different errors:

Expected Behavior

keda-operator should not crash

Actual Behavior

keda operator pod crashes daily once with an error code 2

Steps to Reproduce the Problem

  1. Install keda helm chart version 2.13.0 on GKE 1.27
  2. Wait for a day with or without any load / autoscaling
  3. Keda operator pod will show restart(s)

Specifications

Keda Operator Pod Status: Screenshot from 2024-02-28 16-13-17

Attaching complete keda operator stacktrace of previous container run

PS: Autoscaling is not affected significantly (even though we get prometheus query timeout issue at random intervals, it does get the metric on retries), but we are looking forward to find a root cause of keda pod getting crashed

JorTurFer commented 6 months ago

PS: Autoscaling is not affected significantly (even though we get prometheus query timeout issue at random intervals, it does get the metric on retries), but we are looking forward to find a root cause of keda pod getting crashed

I've not checked it yet, but it looks as an issue with the internal cache. WDYT @zroubalik ?

zroubalik commented 6 months ago

@mustaFAB53 thanks for reporting. Could you please also share the ScaledObject that causes this?

mustaFAB53 commented 6 months ago

Hi @zroubalik,

Attaching the scaledobject kubernetes manifest being applied scaledobject.zip

zroubalik commented 6 months ago

Hi, the polling interval set to 1s is too aggressive. You Prometheus Server instance is not able to properly respond in time. I would definitely recommend you to extend the polling interval to at least 30s and then try to find a lower value that's reasonable for you and you don't see following problems in the output:

    {"type": "ScaledObject", "namespace": "app1", "name": "myapp", "error": "Get \"http://prometheus_frontend:9090/api/v1/query?query=truncated_query&time=2024-02-28T09:59:41Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
    /workspace/pkg/scalers/prometheus_scaler.go:391
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
    /workspace/pkg/scaling/cache/scalers_cache.go:130
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
    /workspace/pkg/scaling/scale_handler.go:743
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
    /workspace/pkg/scaling/scale_handler.go:628
2024-02-28T10:00:48Z    ERROR   prometheus_scaler   error executing prometheus query    {"type": "ScaledObject", "namespace": "app1", "name": "myapp", "error": "Get \"http://prometheus_frontend:9090/api/v1/query?query=truncated_query&time=2024-02-28T10:00:45Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
    /workspace/pkg/scalers/prometheus_scaler.go:391
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
    /workspace/pkg/scaling/cache/scalers_cache.go:130
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
    /workspace/pkg/scaling/scale_handler.go:743
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
    /workspace/pkg/scaling/scale_handler.go:628
2024-02-28T10:02:53Z    ERROR   prometheus_scaler   error executing prometheus query

You can also try to tweak HTTP related settings: https://keda.sh/docs/2.13/operate/cluster/#http-timeouts

mustaFAB53 commented 6 months ago

Hi @zroubalik,

We have kept the polling interval this aggressive as we wanted scale up to happen immediately considering spike traffic. I will try increasing it to check if keda pod doesn't get crashed.

Regarding timeout settings, I had already tried to set it to 20000 (20s) but could not see any improvement.

Pixis-Akshay-Gopani commented 5 months ago

@zroubalik i am also facing this issue in keda version 2.11.0

zroubalik commented 4 months ago

@mustaFAB53 I understand, but in this case you should also boost your Prometheus, as it is the origin of the problems - it is not able to respon in time.

shmuelarditi commented 3 months ago

+1 panic: runtime error: invalid memory address or nil pointer dereference has anyone one is working on a fix? is there something we can do to avoid getting this?

KEDA 2.11 , K8S 1.27

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 month ago

This issue has been automatically closed due to inactivity.