Closed sushama-kothawale closed 2 months ago
Hello! Do you see any error in keda operator logs?
@JorTurFer As the keda-operator is going into crashloopbackoff state i.e. resrting due to OOM. Do not see any error logs there, but in keda-metrics server I see below logs:
g/server/filters.(*timeoutHandler).ServeHTTP in goroutine 407499 /workspace/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:101 +0x1b2 W0408 05:23:19.046791 1 logging.go:59] [core] [Channel #1 SubChannel #85] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.125.254:9666: connect: connection refused" I0408 05:24:04.961431 1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server" I0408 05:24:25.534490 1 client.go:88] keda_metrics_adapter/provider
The problem is the KEDA operator pod in this case, so we need those logs. What about previous container logs?
How many ScaledObect do you have?
Currently we have only 2 scaledobjects, 1 is for cpu and memory based scaling and another has SQs scaler added.
> k get scaledobject -A
> NAMESPACE NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE
> staging dtl-dlm-pythonb apps/v1.Deployment dtl-dlm-pythonb 1 5 cpu True False False Unknown 2d21h
> staging ntf-svc-nh-c apps/v1.Deployment ntf-svc-nh-c 1 2 memory keda-trigger-auth-aws-credentials-irsa Unknown Unknown Unknown Unknown 27s
Here are keda pod status:
k get pods -n keda
NAME READY STATUS RESTARTS AGE
keda-admission-webhooks-595bf74cdf-swjdr 1/1 Running 0 2d21h
keda-operator-865595df44-r4p8h 0/1 CrashLoopBackOff 6 (119s ago) 94m
keda-operator-865595df44-xjk47 0/1 CrashLoopBackOff 6 (2m39s ago) 3h39m
keda-operator-metrics-apiserver-84f9498d4d-6h26r 1/1 Running 0 176m
keda-operator-metrics-apiserver-84f9498d4d-k5xqs 1/1 Running 0 27h
Here are the keda previous container logs:
> k logs keda-operator-865595df44-r4p8h -n keda -p keda-operator
> 2024/04/11 07:23:08 maxprocs: Updating GOMAXPROCS=2: determined from CPU quota
> 2024-04-11T07:23:08Z INFO setup Starting manager
> 2024-04-11T07:23:08Z INFO setup KEDA Version: 2.13.0
> 2024-04-11T07:23:08Z INFO setup Git Commit: 367fcd32f1e72be504ada53165d07c33d75fc0f7
> 2024-04-11T07:23:08Z INFO setup Go Version: go1.21.6
> 2024-04-11T07:23:08Z INFO setup Go OS/Arch: linux/amd64
> 2024-04-11T07:23:08Z INFO setup Running on Kubernetes 1.28+ {"version": "v1.28.7-eks-b9c9ed7"}
> 2024-04-11T07:23:08Z INFO starting server {"kind": "health probe", "addr": "[::]:8081"}
> I0411 07:23:08.977079 1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
> I0411 07:23:25.828622 1 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh
> 2024-04-11T07:23:25Z INFO Starting EventSource {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
> 2024-04-11T07:23:25Z INFO Starting EventSource {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
> 2024-04-11T07:23:25Z INFO Starting Controller {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
> 2024-04-11T07:23:25Z INFO Starting EventSource {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
> 2024-04-11T07:23:25Z INFO Starting Controller {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
> 2024-04-11T07:23:25Z INFO Starting EventSource {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
> 2024-04-11T07:23:25Z INFO Starting Controller {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
> 2024-04-11T07:23:25Z INFO Starting EventSource {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "source": "kind source: *v1alpha1.CloudEventSource"}
> 2024-04-11T07:23:25Z INFO Starting Controller {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
> 2024-04-11T07:23:25Z INFO Starting EventSource {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
> 2024-04-11T07:23:25Z INFO cert-rotation starting cert rotator controller
> 2024-04-11T07:23:25Z INFO Starting Controller {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
> 2024-04-11T07:23:25Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
> 2024-04-11T07:23:25Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
> 2024-04-11T07:23:25Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
> 2024-04-11T07:23:25Z INFO Starting Controller {"controller": "cert-rotator"}
> 2024-04-11T07:23:25Z INFO cert-rotation no cert refresh needed
> 2024-04-11T07:23:25Z INFO Starting workers {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
> 2024-04-11T07:23:25Z INFO Reconciling ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "07b413e1-63a5-44aa-bbff-c4117d7b8be8"}
> 2024-04-11T07:23:25Z INFO Starting workers {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
> 2024-04-11T07:23:25Z INFO Starting workers {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
> 2024-04-11T07:23:25Z INFO cert-rotation certs are ready in /certs
> 2024-04-11T07:23:25Z INFO Starting workers {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
> 2024-04-11T07:23:25Z INFO Reconciling ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2e5589f4-e024-4cdb-b398-b14fa69cde73"}
> 2024-04-11T07:23:25Z INFO Starting workers {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "worker count": 1}
> 2024-04-11T07:23:25Z INFO Detected resource targeted for scaling {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "07b413e1-63a5-44aa-bbff-c4117d7b8be8", "resource": "apps/v1.Deployment", "name": "dtl-dlm-pythonb"}
> 2024-04-11T07:23:25Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1}
> 2024-04-11T07:23:25Z INFO cert-rotation no cert refresh needed
> 2024-04-11T07:23:25Z INFO cert-rotation Ensuring CA cert {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
> 2024-04-11T07:23:25Z INFO Detected resource targeted for scaling {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2e5589f4-e024-4cdb-b398-b14fa69cde73", "resource": "apps/v1.Deployment", "name": "ntf-svc-nh-c"}
> 2024-04-11T07:23:25Z INFO Creating a new HPA {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2e5589f4-e024-4cdb-b398-b14fa69cde73", "HPA.Namespace": "staging", "HPA.Name": "keda-hpa-ntf-svc-nh-c"}
> 2024-04-11T07:23:25Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
> 2024-04-11T07:23:25Z INFO cert-rotation no cert refresh needed
> 2024-04-11T07:23:25Z INFO cert-rotation Ensuring CA cert {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
> 2024-04-11T07:23:26Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
> 2024-04-11T07:23:27Z INFO cert-rotation CA certs are injected to webhooks
> 2024-04-11T07:23:27Z INFO grpc_server Starting Metrics Service gRPC Server {"address": ":9666"}
> 2024-04-11T07:23:27Z INFO Initializing Scaling logic according to ScaledObject Specification {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "07b413e1-63a5-44aa-bbff-c4117d7b8be8"}
> 2024-04-11T07:23:27Z INFO Reconciling ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "943c5385-e3b4-4740-9d95-95ba12c15c8a"}
> 2024-04-11T07:23:27Z INFO Detected resource targeted for scaling {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "943c5385-e3b4-4740-9d95-95ba12c15c8a", "resource": "apps/v1.Deployment", "name": "dtl-dlm-pythonb"}
keda-metrics server logs:
> k logs keda-operator-metrics-apiserver-84f9498d4d-6h26r -n keda
> 2024/04/11 04:35:21 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
> I0411 04:35:22.152840 1 welcome.go:34] keda_metrics_adapter "msg"="Starting metrics server"
> I0411 04:35:22.152863 1 welcome.go:35] keda_metrics_adapter "msg"="KEDA Version: 2.13.0"
> I0411 04:35:22.152875 1 welcome.go:36] keda_metrics_adapter "msg"="Git Commit: 367fcd32f1e72be504ada53165d07c33d75fc0f7"
> I0411 04:35:22.152888 1 welcome.go:37] keda_metrics_adapter "msg"="Go Version: go1.21.6"
> I0411 04:35:22.152900 1 welcome.go:38] keda_metrics_adapter "msg"="Go OS/Arch: linux/amd64"
> I0411 04:35:22.152962 1 welcome.go:39] keda_metrics_adapter "msg"="Running on Kubernetes 1.28+" "version"={"major":"1","minor":"28+","gitVersion":"v1.28.7-eks-b9c9ed7","gitCommit":"217b17b5aed5ddc370672ff6c12d78037cbb8e7a","gitTreeState":"clean","buildDate":"2024-03-02T03:47:14Z","goVersion":"go1.21.7","compiler":"gc","platform":"linux/amd64"}
> I0411 04:35:22.153683 1 main.go:125] keda_metrics_adapter "msg"="Connecting Metrics Service gRPC client to the server" "address"="keda-operator.keda.svc.cluster.local:9666"
> I0411 04:35:22.177466 1 provider.go:58] keda_metrics_adapter/provider "msg"="starting"
> I0411 04:35:22.177522 1 main.go:259] keda_metrics_adapter "msg"="starting adapter..."
> I0411 04:35:22.179088 1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server"
> I0411 04:35:22.179128 1 main.go:176] keda_metrics_adapter "msg"="starting /metrics server endpoint"
> W0411 04:35:22.436600 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> I0411 04:35:22.933054 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/certs/tls.crt::/certs/tls.key"
> I0411 04:35:22.933336 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
> I0411 04:35:22.933358 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
> I0411 04:35:22.934083 1 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/certs/ca.crt"
> I0411 04:35:22.934213 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
> I0411 04:35:22.934227 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
> I0411 04:35:22.935197 1 secure_serving.go:213] Serving securely on [::]:6443
> I0411 04:35:22.935257 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
> I0411 04:35:23.033444 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
> I0411 04:35:23.034803 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
> W0411 04:35:23.444580 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 04:35:25.230358 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> I0411 04:35:28.697084 1 provider.go:66] keda_metrics_adapter/provider "msg"="Connection to KEDA Metrics Service gRPC server has been successfully established" "server"="keda-operator.keda.svc.cluster.local:9666"
> I0411 04:45:15.807582 1 trace.go:236] Trace[643673315]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:49d6f012-da42-466f-819c-269e60d8a349,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 04:45:15.199) (total time: 608ms):
> Trace[643673315]: ---"Listing from storage done" 607ms (04:45:15.807)
> Trace[643673315]: [608.170519ms] [608.170519ms] END
> I0411 05:16:56.619705 1 trace.go:236] Trace[644596009]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:0c46820e-d630-421c-b20d-f79061b2e28e,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:16:54.699) (total time: 1919ms):
> Trace[644596009]: ---"Listing from storage done" 1919ms (05:16:56.619)
> Trace[644596009]: [1.919699701s] [1.919699701s] END
> I0411 05:26:32.863772 1 trace.go:236] Trace[1355079054]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:f281c6b5-69ad-4456-a9c2-4fba9f70c0c2,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:26:32.350) (total time: 513ms):
> Trace[1355079054]: ---"Listing from storage done" 513ms (05:26:32.863)
> Trace[1355079054]: [513.637067ms] [513.637067ms] END
> I0411 05:44:01.477798 1 trace.go:236] Trace[922117958]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:87df784c-3234-416c-97bd-17824c9cf0ab,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:44:00.907) (total time: 570ms):
> Trace[922117958]: ---"Listing from storage done" 570ms (05:44:01.477)
> Trace[922117958]: [570.608793ms] [570.608793ms] END
> I0411 05:49:36.374413 1 trace.go:236] Trace[1657932732]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:359dce89-e345-4722-adbc-7d77117fade3,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:49:35.792) (total time: 581ms):
> Trace[1657932732]: ---"Listing from storage done" 581ms (05:49:36.374)
> Trace[1657932732]: [581.756559ms] [581.756559ms] END
> I0411 05:51:07.990366 1 trace.go:236] Trace[239221352]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:4c292ca4-0fba-420f-b448-fa1bde5c05ff,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:51:07.410) (total time: 580ms):
> Trace[239221352]: ---"Listing from storage done" 579ms (05:51:07.990)
> Trace[239221352]: [580.10165ms] [580.10165ms] END
> I0411 05:57:19.533902 1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server"
> I0411 05:57:20.236923 1 trace.go:236] Trace[1925121293]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:f9575a56-81a5-47de-921e-b400aa30e159,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:57:19.533) (total time: 703ms):
> Trace[1925121293]: ---"Listing from storage done" 702ms (05:57:20.236)
> Trace[1925121293]: [703.04749ms] [703.04749ms] END
> I0411 06:11:02.377385 1 trace.go:236] Trace[1044216767]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:779a2f79-2716-44d8-9e5c-a9826c31e329,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:11:01.794) (total time: 582ms):
> Trace[1044216767]: ---"Listing from storage done" 581ms (06:11:02.375)
> Trace[1044216767]: [582.313639ms] [582.313639ms] END
> I0411 06:22:10.802200 1 trace.go:236] Trace[793805765]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:09c2c32c-ea08-44bc-8c2b-236fd83169ac,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:22:10.189) (total time: 612ms):
> Trace[793805765]: ---"Listing from storage done" 612ms (06:22:10.802)
> Trace[793805765]: [612.320822ms] [612.320822ms] END
> I0411 06:30:17.878940 1 trace.go:236] Trace[1231255657]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:fa2585f1-15b9-4b60-a255-36b1ba92c92b,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:30:15.815) (total time: 2063ms):
> Trace[1231255657]: ---"Listing from storage done" 2063ms (06:30:17.878)
> Trace[1231255657]: [2.06330352s] [2.06330352s] END
> I0411 06:37:25.510212 1 trace.go:236] Trace[805102546]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:ee38a2f0-35dc-4e14-8d5b-1f33ea94de6e,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:37:24.949) (total time: 560ms):
> Trace[805102546]: ---"Listing from storage done" 560ms (06:37:25.509)
> Trace[805102546]: [560.520059ms] [560.520059ms] END
> I0411 06:52:36.208355 1 trace.go:236] Trace[1665426851]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:e4487d9f-7d98-4f19-9559-3c1d783c66fe,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:52:35.640) (total time: 567ms):
> Trace[1665426851]: ---"Listing from storage done" 567ms (06:52:36.208)
> Trace[1665426851]: [567.6487ms] [567.6487ms] END
> I0411 07:19:56.653991 1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server"
> I0411 07:19:57.344252 1 trace.go:236] Trace[1664014102]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:26bfc4dd-bd9f-4b5e-94d4-23ff2cdc04a0,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 07:19:56.653) (total time: 690ms):
> Trace[1664014102]: ---"Listing from storage done" 690ms (07:19:57.344)
> Trace[1664014102]: [690.301606ms] [690.301606ms] END
> I0411 07:20:56.166708 1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server"
> I0411 07:20:56.838948 1 trace.go:236] Trace[271217008]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:f78e8767-fa07-4dbd-a504-344560dadfc1,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 07:20:56.166) (total time: 672ms):
> Trace[271217008]: ---"Listing from storage done" 672ms (07:20:56.838)
> Trace[271217008]: [672.262607ms] [672.262607ms] END
> I0411 07:21:36.861992 1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server"
> W0411 07:21:36.868593 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:21:37.871959 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:21:39.519084 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:21:41.619342 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:21:46.149051 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> I0411 07:21:52.561806 1 trace.go:236] Trace[346857927]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:9fd2f6c1-cff8-41c9-872b-4b5657db0a4e,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 07:21:36.861) (total time: 15699ms):
> Trace[346857927]: ---"Listing from storage done" 15699ms (07:21:52.561)
> Trace[346857927]: [15.699852644s] [15.699852644s] END
> I0411 07:22:10.441735 1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server"
> W0411 07:22:10.453372 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:11.458402 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:12.849287 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:14.909921 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:19.246492 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:27.023611 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:36.183544 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:56.137941 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> E0411 07:23:10.506869 1 provider.go:91] keda_metrics_adapter/provider "msg"="timeout" "error"="timeout while waiting to establish gRPC connection to KEDA Metrics Service server" "server"="keda-operator.keda.svc.cluster.local:9666"
> I0411 07:23:10.506985 1 trace.go:236] Trace[1755005721]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:04f1837a-d0db-4408-86ae-97ae6306d089,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 07:22:10.441) (total time: 60065ms):
> Trace[1755005721]: [1m0.065285782s] [1m0.065285782s] END
> E0411 07:23:10.507391 1 timeout.go:142] post-timeout activity - time-elapsed: 71.754312ms, GET "/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus" result: runtime error: invalid memory address or nil pointer dereference
> goroutine 61560 [running]:
> k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
> /workspace/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:110 +0x9c
> panic({0x26453c0?, 0x476d830?})
> /usr/local/go/src/runtime/panic.go:914 +0x21f
> sigs.k8s.io/custom-metrics-apiserver/pkg/registry/external_metrics.(*REST).List(0xc000860e10, {0x2fa2630, 0xc0014bec90}, 0x0?)
> /workspace/vendor/sigs.k8s.io/custom-metrics-apiserver/pkg/registry/external_metrics/reststorage.go:92 +0x112
> k8s.io/apiserver/pkg/endpoints/handlers.ListResource.func1({0x2f93740?, 0xc001223fc0}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go:278 +0xe9c
> sigs.k8s.io/custom-metrics-apiserver/pkg/apiserver/installer.(*EMHandlers).registerResourceHandlers.restfulListResource.func1(0xc001223fa0, 0xc00061f570)
> /workspace/vendor/sigs.k8s.io/custom-metrics-apiserver/pkg/apiserver/installer/installer.go:274 +0x6a
> sigs.k8s.io/custom-metrics-apiserver/pkg/apiserver/installer.(*EMHandlers).registerResourceHandlers.InstrumentRouteFunc.func2(0xc001223fa0, 0xc00061f570)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:571 +0x1ce
> github.com/emicklei/go-restful/v3.(*Container).dispatch(0xc000b6ec60, {0x2f93740, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/github.com/emicklei/go-restful/v3/container.go:299 +0x5f0
> github.com/emicklei/go-restful/v3.(*Container).Dispatch(...)
> /workspace/vendor/github.com/emicklei/go-restful/v3/container.go:204
> k8s.io/apiserver/pkg/server.director.ServeHTTP({{0x2ae7a39?, 0x2976b00?}, 0xc000b6ec60?, 0xc0004f0d20?}, {0x2f93740, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/server/handler.go:146 +0x56c
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func21({0x2f93740?, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
> net/http.HandlerFunc.ServeHTTP(0x2fa2630?, {0x2f93740?, 0xc001223d60?}, 0x4?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filters.withAuthorization.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filters/authorization.go:78 +0x639
> net/http.HandlerFunc.ServeHTTP(0xc17e01c89a01ca85?, {0x2f93740?, 0xc001223d60?}, 0xc0006c58b8?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192
> net/http.HandlerFunc.ServeHTTP(0x47875e0?, {0x2f93740?, 0xc001223d60?}, 0x4?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/server/filters/maxinflight.go:196 +0x262
> net/http.HandlerFunc.ServeHTTP(0x192bf2f?, {0x2f93740?, 0xc001223d60?}, 0x2f6b130?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func23({0x2f93740?, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
> net/http.HandlerFunc.ServeHTTP(0x1?, {0x2f93740?, 0xc001223d60?}, 0x1?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithImpersonation.func4({0x2f93740, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filters/impersonation.go:50 +0x1c3
> net/http.HandlerFunc.ServeHTTP(0x812080?, {0x2f93740?, 0xc001223d60?}, 0x0?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192
> net/http.HandlerFunc.ServeHTTP(0x192bf2f?, {0x2f93740?, 0xc001223d60?}, 0x2f6b130?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func24({0x2f93740?, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
> net/http.HandlerFunc.ServeHTTP(0x812080?, {0x2f93740?, 0xc001223d60?}, 0x2?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192
> net/http.HandlerFunc.ServeHTTP(0x192bf2f?, {0x2f93740?, 0xc001223d60?}, 0x2f6b130?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func26({0x2f93740?, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
> net/http.HandlerFunc.ServeHTTP(0x2fa2630?, {0x2f93740?, 0xc001223d60?}, 0x2f651f8?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filters.withAuthentication.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filters/authentication.go:120 +0x7e5
> net/http.HandlerFunc.ServeHTTP(0x2fa2630?, {0x2f93740?, 0xc001223d60?}, 0x2f6b130?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x2f93740, 0xc001223d60}, 0xc000d85200)
> /workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:94 +0x37a
> net/http.HandlerFunc.ServeHTTP(0x0?, {0x2f93740?, 0xc001223d60?}, 0x476d6b0?)
> /usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1()
> /workspace/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:115 +0x62
> created by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP in goroutine 61559
> /workspace/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:101 +0x1b2
> W0411 07:23:22.835977 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:24:02.456999 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:25:21.600341 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
Pod evets:
> State: Waiting
> Reason: CrashLoopBackOff
> Last State: Terminated
> Reason: OOMKilled
> Exit Code: 137
> Started: Thu, 11 Apr 2024 12:55:10 +0530
> Finished: Thu, 11 Apr 2024 12:55:36 +0530
> Ready: False
> Restart Count: 5
> Limits:
> cpu: 2
> memory: 4Gi
> Requests:
> cpu: 500m
> memory: 1Gi
> Liveness: http-get http://:8081/healthz delay=25s timeout=1s period=10s #success=1 #failure=3
> Readiness: http-get http://:8081/readyz delay=20s timeout=1s period=3s #success=1 #failure=3
> Environment:
> WATCH_NAMESPACE:
> POD_NAME: keda-operator-865595df44-r4p8h (v1:metadata.name)
> POD_NAMESPACE: keda (v1:metadata.namespace)
> OPERATOR_NAME: keda-operator
> KEDA_HTTP_DEFAULT_TIMEOUT: 3000
> KEDA_HTTP_MIN_TLS_VERSION: TLS12
> AWS_STS_REGIONAL_ENDPOINTS: regional
> AWS_DEFAULT_REGION: ap-south-1
> AWS_REGION: ap-south-1
> AWS_ROLE_ARN: arn:aws:iam::845925808232:role/mtDevopsKedaReadRole
> AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
> Mounts:
> /certs from certificates (ro)
> /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
> /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m2s7l (ro)
> Conditions:
> Type Status
> Initialized True
> Ready False
> ContainersReady False
> PodScheduled True
> Volumes:
> aws-iam-token:
> Type: Projected (a volume that contains injected data from multiple sources)
> TokenExpirationSeconds: 86400
> certificates:
> Type: Secret (a volume populated by a Secret)
> SecretName: kedaorg-certs
> Optional: true
> kube-api-access-m2s7l:
> Type: Projected (a volume that contains injected data from multiple sources)
> TokenExpirationSeconds: 3607
> ConfigMapName: kube-root-ca.crt
> ConfigMapOptional: <nil>
> DownwardAPI: true
> QoS Class: Burstable
> Node-Selectors: kubernetes.io/os=linux
> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
> node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
> Events:
> Type Reason Age From Message
> ---- ------ ---- ---- -------
> Warning Unhealthy 6m40s kubelet Readiness probe failed: Get "http://10.83.52.218:8081/readyz": dial tcp 10.83.52.218:8081: connect: connection refused
> Normal Pulled 6m39s kubelet Successfully pulled image "ghcr.io/kedacore/keda:2.13.0" in 745ms (745ms including waiting)
> Warning Unhealthy 5m40s kubelet Readiness probe failed: Get "http://10.83.52.218:8081/readyz": read tcp 10.83.63.130:59102->10.83.52.218:8081: read: connection reset by peer
> Normal Pulled 5m22s kubelet Successfully pulled image "ghcr.io/kedacore/keda:2.13.0" in 752ms (752ms including waiting)
> Warning Unhealthy 4m51s kubelet Readiness probe failed: Get "http://10.83.52.218:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
> Normal Pulling 4m18s (x4 over 89m) kubelet Pulling image "ghcr.io/kedacore/keda:2.13.0"
> Normal Created 4m17s (x4 over 88m) kubelet Created container keda-operator
> Normal Started 4m17s (x4 over 88m) kubelet Started container keda-operator
> Normal Pulled 4m17s kubelet Successfully pulled image "ghcr.io/kedacore/keda:2.13.0" in 748ms (748ms including waiting)
> Warning BackOff 93s (x19 over 5m39s) kubelet Back-off restarting failed container keda-operator in pod keda-operator-865595df44-r4p8h_keda(4d548af6-80fe-43c5-806f-b8f80eed33f7)
Let me know if you need any more details. Thanks @JorTurFer for looking into it.
Are you using latest version of KEDA, right? Could you extract a memory profile and share it please? https://keda.sh/docs/2.13/troubleshooting/#troubleshoot-keda-errors-using-profiling
Yes, its on latest version i.e. 2.13.1
> helm history keda -n keda
> REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
> 1 Mon Apr 8 15:04:10 2024 superseded keda-2.13.1 2.13.0 Install complete
> 2 Thu Apr 11 13:10:31 2024 deployed keda-2.13.1 2.13.0 Upgrade complete
Could you extract a memory profile and share it please? https://keda.sh/docs/2.13/troubleshooting/#troubleshoot-keda-errors-using-profiling
I have arm machine:
uname -m
arm64
@JorTurFer Looks like https://go.dev/blog/pprof this tool is run on only amd64 machine. any other way /tool for profiling on arm?
I think that it'll work with arm64 too, the post is quite old, can you give a try? pprof is a standard in golang
As the keda-operator pods are restarting very frequently. not able to start profiling on that. Getting below error:
`go tool pprof http://localhost:8082/debug/pprof/profile
Fetching profile over HTTP from http://localhost:8082/debug/pprof/profile
http://localhost:8082/debug/pprof/profile: Get "http://localhost:8082/debug/pprof/profile": EOF
failed to fetch any source profiles
` And the port forward also gets exited after above command:
`kubectl port-forward keda-operator-8888c8544-qk8wn -n keda 8082:8082
Forwarding from 127.0.0.1:8082 -> 8082
Forwarding from [::1]:8082 -> 8082
Handling connection for 8082
E0411 14:02:25.170512 38035 portforward.go:406] an error occurred forwarding 8082 -> 8082: error forwarding port 8082 to pod 339e41bc9574f9fa109a350c1efe3f3bdbdbc416338031e55f399b3a999cf672, uid : failed to execute portforward in network namespace "/var/run/netns/cni-a9b64bd0-a4db-c87e-373e-ba510cba82c3": failed to connect to localhost:8082 inside namespace "339e41bc9574f9fa109a350c1efe3f3bdbdbc416338031e55f399b3a999cf672", IPv4: dial tcp4 127.0.0.1:8082: connect: connection refused IPv6 dial tcp6: address localhost: no suitable address found
E0411 14:02:25.171839 38035 portforward.go:234] lost connection to pod`
Could you share your TriggerAuthentication too? are you using aws-eks
or aws
?
using aws-eks
k get scaledobject -A
NAMESPACE NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE
staging dtl-dlm-pythonb apps/v1.Deployment dtl-dlm-pythonb 1 5 cpu True False False Unknown 3d
staging ntf-svc-nh-c apps/v1.Deployment ntf-svc-nh-c 1 2 memory keda-trigger-auth-aws-credentials-irsa Unknown Unknown Unknown Unknown 177m
k get triggerauth -A
NAMESPACE NAME PODIDENTITY SECRET ENV VAULTADDRESS
staging keda-trigger-auth-aws-credentials-irsa aws-eks
Okey, if you remove the SQS ScaledObject temporally, does KEDA become stable? I mean, we are sure that's the SQS Scaler, right?
We detected some problems with aws-eks
authentication and we developed a new one aws
. From AWS pov it's the same but we have implemented some caches and other improvements.
If you are using the scaled workload role, you have to configure the TriggerAuthentication like:
spec:
podIdentity:
provider: aws
identityOwner: workload
Yes, we remove SQS scaledbject, keda becomes stable. No single restart on keda operator pods.
@JorTurFer Just now tried with aws with identityOwner as workload, still getting OOM issues on keda-operator:
k get scaledobject -A
NAMESPACE NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE
staging dtl-dlm-pythonb apps/v1.Deployment dtl-dlm-pythonb 1 5 cpu True False False Unknown 3d2h
staging ntf-svc-nh-c apps/v1.Deployment ntf-svc-nh-c 1 2 memory keda-trigger-auth-aws-credentials-irsa Unknown Unknown Unknown Unknown 5h21m
k get triggerauth -A
NAMESPACE NAME PODIDENTITY SECRET ENV VAULTADDRESS
staging keda-trigger-auth-aws-credentials-irsa aws
k get pods -n keda
NAME READY STATUS RESTARTS AGE
keda-admission-webhooks-595bf74cdf-swjdr 1/1 Running 0 3d3h
keda-operator-768cf8cd46-pcjs9 0/1 CrashLoopBackOff 4 (13s ago) 7m48s
keda-operator-768cf8cd46-z6j9b 0/1 CrashLoopBackOff 4 (45s ago) 7m48s
keda-operator-metrics-apiserver-84f9498d4d-k5xqs 1/1 Running 0 32h
keda-operator-metrics-apiserver-84f9498d4d-nlx6g 1/1 Running 0 4h59m
Adding triggerauth definition here:
k get triggerauth keda-trigger-auth-aws-credentials-irsa -n staging -o yaml
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
annotations:
meta.helm.sh/release-name: devops--configs-staging
meta.helm.sh/release-namespace: staging
creationTimestamp: "2024-04-08T09:35:55Z"
finalizers:
- finalizer.keda.sh
generation: 4
labels:
app.kubernetes.io/managed-by: Helm
name: keda-trigger-auth-aws-credentials-irsa
namespace: staging
resourceVersion: "3074490872"
uid: da7e8702-058e-400a-99c4-ae47c7adda75
spec:
podIdentity:
identityOwner: workload
provider: aws
roleArn: ""
`
@zroubalik , any idea?
@JorTurFer @zroubalik Any update on this? We are blocked on this to onboard some apps on prod with KEDA SQS scaler.
This is strange. Also hard to say without details. Could you please try a different AWS scaler with the same style auth? And maybe start profiling on runing KEDA and then add the SQS scaledobject?
We observed that keda is stable after fresh install. Once we add any scale object even cpu, memory one it start going OOM. And it continue in OOM loop.
Is there any way to write profiling data outside pod. So we can visualize even after pod restart
Is there any way to write profiling data outside pod. So we can visualize even after pod restart
It's not possible to profile KEDA if it dies automatically just after start. Do you have a huge amount or resources in your cluster or so? I mean, are you facing with this on a huge cluster? Maybe any issue on the operator caches trying to pull a lot of resources ?
@JorTurFer Yes our cluster is big, but we are facing this issue after upgrading keda from 2.11.* to 2.12 and to 2.13
Even we tried to uninstall keda and re-install it again. But as soon as we add sqs or prometheus, keda-operator pods go in OOM.
We are blocked on this for so longer. This impacting some of the apps scalability on prod environment.
As we can't do profiling due to pod crashing, are there any ways or flags that we can test out ?
@JorTurFer Again did some analysis:
cpu- 0.0007, 0.0112
mem - 16MB, 38MB
cpu- 0.0007, 0.0184
mem - 16MB, 256MB
terminated
Reason:Reason: OOMKilled - exit code: 137
k get pods -n keda
NAME READY STATUS RESTARTS AGE
keda-admission-webhooks-579c8f6d95-tcbl7 1/1 Running 0 119m
keda-operator-75f5dd7c96-n7tnd 0/1 CrashLoopBackOff 7 (2m15s ago) 119m
keda-operator-75f5dd7c96-nsw52 0/1 CrashLoopBackOff 6 (4m11s ago) 15m
keda-operator-75f5dd7c96-q8snr 0/1 CrashLoopBackOff 7 (2m46s ago) 119m
keda-operator-metrics-apiserver-55b44f9b88-972cs 1/1 Running 0 119m
keda-operator-metrics-apiserver-55b44f9b88-t2jds 1/1 Running 0 119m
8. When we check usage on grafana, we see its in within specified limits as we have allocated below resources to keda operator:
resources:
limits:
cpu: '1'
memory: 4Gi
requests:
cpu: 100m
memory: 2Gi
Attaching the grafana screenshot here.
<img width="1538" alt="Screenshot 2024-04-30 at 5 46 16 PM" src="https://github.com/kedacore/keda/assets/114404258/992ce86b-3751-4960-8677-e7d0fae41644">
I can see lot of Network IO operations on keda-operator as soon as SQs scaler is added. Please refer above screenshot. Can this be the issue?
Appreciate if you can please look into this again.
The network activity is between the operator and internet or between the KEDA components? the problem isn't v2.13 but 2.12 as far as I understand, right?
I'm checking and there is a big important change between 2.11 and 2.12 related with the SDK https://github.com/kedacore/keda/commit/791c895bf53984220bef9981e161cf00d5fcc9ee because we changed the SDK v1 to the SDK v2, but it's the only change related with AWS that I can see and it doesn't justify the issue with prometheus.
When you are getting the pod logs, are you getting the previous container logs too, right ? (-p modifier)
@JorTurFer The network activity is between the operator and internet. Till keda v2.11 version everything was working fine, post upgrading it to 2.12, 2.13 and 2.14 we are seeing OOM issues. Yes, I can see previous OOM killed container logs too.
k get pods -n keda
NAME READY STATUS RESTARTS AGE
keda-admission-webhooks-579c8f6d95-tcbl7 1/1 Running 0 44h
keda-operator-75f5dd7c96-krghr 0/1 CrashLoopBackOff 5 (80s ago) 18m
keda-operator-75f5dd7c96-tgzr4 0/1 CrashLoopBackOff 5 (24s ago) 7m56s
keda-operator-metrics-apiserver-55b44f9b88-252qk 1/1 Running 0 19h
keda-operator-metrics-apiserver-55b44f9b88-972cs 1/1 Running 0 44h
sushamakothawale@Sushama-Kothawale DOPS-6666 % k logs keda-operator-75f5dd7c96-krghr -n keda -p
2024/05/02 06:24:51 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
2024-05-02T06:24:51Z INFO setup Starting manager
2024-05-02T06:24:51Z INFO setup KEDA Version: 2.14.0
2024-05-02T06:24:51Z INFO setup Git Commit: 6681d5ecbc82293626cab693625c9cc9fb453381
2024-05-02T06:24:51Z INFO setup Go Version: go1.21.9
2024-05-02T06:24:51Z INFO setup Go OS/Arch: linux/amd64
2024-05-02T06:24:51Z INFO setup Running on Kubernetes 1.28+ {"version": "v1.28.7-eks-b9c9ed7"}
2024-05-02T06:24:51Z INFO starting server {"kind": "health probe", "addr": "[::]:8081"}
I0502 06:24:51.761839 1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
I0502 06:25:08.793430 1 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh
2024-05-02T06:25:08Z INFO Starting EventSource {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
2024-05-02T06:25:08Z INFO Starting EventSource {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
2024-05-02T06:25:08Z INFO Starting Controller {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-05-02T06:25:08Z INFO Starting EventSource {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
2024-05-02T06:25:08Z INFO Starting Controller {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-05-02T06:25:08Z INFO Starting EventSource {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
2024-05-02T06:25:08Z INFO Starting Controller {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-05-02T06:25:08Z INFO Starting EventSource {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "source": "kind source: *v1alpha1.CloudEventSource"}
2024-05-02T06:25:08Z INFO Starting Controller {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-05-02T06:25:08Z INFO Starting EventSource {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
2024-05-02T06:25:08Z INFO Starting Controller {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-05-02T06:25:08Z INFO cert-rotation starting cert rotator controller
2024-05-02T06:25:08Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-05-02T06:25:08Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-05-02T06:25:08Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-05-02T06:25:08Z INFO Starting Controller {"controller": "cert-rotator"}
2024-05-02T06:25:08Z INFO cert-rotation no cert refresh needed
2024-05-02T06:25:08Z INFO cert-rotation certs are ready in /certs
2024-05-02T06:25:08Z INFO Starting workers {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
2024-05-02T06:25:08Z INFO Starting workers {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
2024-05-02T06:25:08Z INFO Reconciling ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2d7f0ebe-8be2-482f-a9a1-57fc7988071b"}
2024-05-02T06:25:08Z INFO Starting workers {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
2024-05-02T06:25:08Z INFO Reconciling ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "fa96a24f-b9b3-43a7-ab06-120fe844d5a3"}
2024-05-02T06:25:08Z INFO Detected resource targeted for scaling {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "fa96a24f-b9b3-43a7-ab06-120fe844d5a3", "resource": "apps/v1.Deployment", "name": "dtl-dlm-pythonb"}
2024-05-02T06:25:08Z INFO Detected resource targeted for scaling {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2d7f0ebe-8be2-482f-a9a1-57fc7988071b", "resource": "apps/v1.Deployment", "name": "ntf-svc-nh-c"}
2024-05-02T06:25:08Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1}
2024-05-02T06:25:08Z INFO cert-rotation no cert refresh needed
2024-05-02T06:25:08Z INFO cert-rotation Ensuring CA cert {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-05-02T06:25:09Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-05-02T06:25:09Z INFO cert-rotation no cert refresh needed
2024-05-02T06:25:09Z INFO cert-rotation Ensuring CA cert {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-05-02T06:25:09Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-05-02T06:25:09Z INFO Starting workers {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "worker count": 1}
2024-05-02T06:25:09Z INFO Starting workers {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
2024-05-02T06:25:10Z INFO cert-rotation CA certs are injected to webhooks
2024-05-02T06:25:10Z INFO grpc_server Starting Metrics Service gRPC Server {"address": ":9666"}
2024-05-02T06:25:10Z INFO Initializing Scaling logic according to ScaledObject Specification {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "fa96a24f-b9b3-43a7-ab06-120fe844d5a3"}
We are using only 2 external sclers as of now i.e. SQS and Prometheus . And for both of these scales we see OOM issues.
At this point, and not having any way to extract the pprof info, the only option that I can offer is debugging it locally.
You can use your current cluster configuration, just having the cluster selected in your kubeconfig. From local, you can check if you have the same issue or not.
To local debug the code, you just need to follow this step-by-step guide and you will be able to run KEDA operator in your local machine: https://github.com/kedacore/keda/blob/main/BUILD.md#debugging-with-vs-code (just ignore the part of make deploy
because you already have KEDA installed on your cluster).
There you can see if the memory grows or not, and if yes, you can use pprof with your local process
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.
keda operator pod increasing memory indefinitely till OOM when using SQS Scaler.
Expected Behavior
Keda should be working properly.
Actual Behavior
Keda operator pods going OOM causing multiple pod restarts of keda-operator.
Steps to Reproduce the Problem
Specifications
Attching the memory usage snapsht from grafana , memory gets spikes as soon as SQS scaler is added
@JorTurFer Can you please check this? Due to this issue we are blocked for few apps deployment integration of keda on prod environment.