kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.27k stars 1.05k forks source link

Getting the error: error when patching "obj.yaml": Timeout: request did not complete within requested timeout - context deadline exceeded #5700

Closed ghostx31 closed 1 month ago

ghostx31 commented 4 months ago

Report

We manage our deployments using ArgoCD. We upgraded our keda from 2.6 to 2.11.2 recently.

We have this issue on only two specific apps after the upgrade. The exact error message is:

Error from server (Timeout): error when applying patch:
...
[{\"metadata\":{\"metricName\":\"rabbitmq_queue_messages\",\"query\":\"sum(rabbitmq_queue_messages{queue=\\\"<redacted>\\\"}) + sum(rabbitmq_queue_messages{queue=\\\"<redacted>\\\"})\",\"serverAddress\":\"http://prometheus-main.prometheus.svc.cluster.local:9090/\",\"threshold\":\"10\"},\"type\":\"prometheus\"}]}}\n"}},"spec":{"minReplicaCount":1}}
to:
Resource: "[keda.sh/v1alpha1](http://keda.sh/v1alpha1), Resource=scaledobjects", GroupVersionKind: "[keda.sh/v1alpha1](http://keda.sh/v1alpha1), Kind=ScaledObject"
Name: "<redacted>", Namespace: "<redacted>"
for: "objyaml": error when patching "obj.yaml": Timeout: request did not complete within requested timeout - context deadline exceeded

This issue occurs both when syncing the Scaled Object from ArgoCD or when applying from kubectl itself. We have 44 scaled objects in this application, out of which ~40 are synced. We get this error when trying to sync for this specific application.

Our environment: Keda version: v2.11.2 GKE version: 1.25.16-gke.1460000

I found another issue which resembles this but felt we should open a new issue due to the difference in environment: https://github.com/kedacore/keda/issues/5487

Expected Behavior

The scaled object should sync without issues.

Actual Behavior

The scaled object does not sync and fails.

Steps to Reproduce the Problem

  1. Upgrade Keda from 2.6 to 2.11.1
  2. Sync the scaled object for an application from ArgoCD or try applying the scaled object from kubectl.

KEDA Version

2.11.2

Kubernetes Version

< 1.26

Platform

Google Cloud

Scaler Details

External scaler - prometheus

AleksanderBrzozowski commented 4 months ago

@ghostx31 Have you figured out what can be causing this issue?

We observe a similar behavior, and to be honest it is not clear for me what component throws timeout error. Is it because Helm applies the change, but the validation webhook doesn't respond quick enough? Or is it something different?

JorTurFer commented 4 months ago

Hello, Did you try removing the SO and adding it again? I'm not really sure about the reason behind this, as the timeout is given by the cluster and not by KEDA

AleksanderBrzozowski commented 4 months ago

@JorTurFer

Removing the SO, and adding it again solves the issue, but it is not convenient to delete and add when we want to make a change 🙁

as the timeout is given by the cluster and not by KEDA

This is the part that I don't understand. I am assuming that the timeout is given by the Kubernetes API Server, but what is the root cause? Is it the keda-admission webhook causing issues? I don't think so, the message would be different in case of webhook failure, something like this:

Internal error occurred: failed calling webhook ...

Any clues? 🙂

JorTurFer commented 4 months ago

Yeah, it's not a solution at all if you have to delete it all the time. After removing it, do you still not be able to modify it? I mean, you've deleted it and it has worked, so now, can you update it or still not?

AleksanderBrzozowski commented 4 months ago

Yeah, it's not a solution at all if you have to delete it all the time. After removing it, do you still not be able to modify it? I mean, you've deleted it and it has worked, so now, can you update it or still not?

Even after deleting and adding it again, I am not able to update it. The same error is returned 🙂

ghostx31 commented 4 months ago

Hello @AleksanderBrzozowski Deleting the SO and then re-syncing it from Argo seems to solve it for us, but this is a bit of hassle and not really a solution since we need to delete and re-sync it every time we need make some change.

AleksanderBrzozowski commented 4 months ago

@ghostx31 Yeah, so we have the same situation, and trying to find a root cause of this. Any clues what might be causing it? 🙂

JorTurFer commented 4 months ago

Could you share the ScaledObject that produces conflicts?

AleksanderBrzozowski commented 4 months ago

Yeah, here it is:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-service
  namespace: my-namespace
spec:
  maxReplicaCount: 60
  minReplicaCount: 2
  pollingInterval: 10
  scaleTargetRef:
    name: my-service
  triggers:
    - metadata:
        metricName: RPS
        query: sum(rate(istio_requests_total{destination_workload_namespace="my-namespace",destination_workload="my-service",
          reporter="destination"}[1m[]))
        serverAddress: http://prometheus-svc.prometheus-ns:9090
        threshold: "500"
      type: prometheus
    - metadata:
        metricName: Latency
        metricType: Value
        query: histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{kubernetes_namespace="my-namespace",
          app="my-service", reporter="destination"}[1m[])) by (le))
        serverAddress: http://prometheus-svc.prometheus-ns:9090
        threshold: "50"
      type: prometheus
JorTurFer commented 3 months ago

Sorry for the delay, I've been quite busy these weeks.

Returning to your case, could you have any issue with the webhooks? Thinking about this, the control plane is calling to all the admission webhooks registered in the clusters (if they have registered the item). KEDA has it's own admission webhook for validating the ScaledObject, do you see any error on it?

You can try disabling the admission webhook temporally just removing the ValidatingWebhookConfiguration. If you remove it, does it work?

AleksanderBrzozowski commented 3 months ago

@JorTurFer

Sorry for the delay, I've been quite busy these weeks.

No worries 🙂

You can try disabling the admission webhook temporally just removing the ValidatingWebhookConfiguration. If you remove it, does it work?

Yeah, we are aware of the webhook, we should try to disable it to see if it helps. What webhook does under the hood?

JorTurFer commented 3 months ago

Basically, a few calls to the control plane to get some extra info, like other HPAs and the workload manifest to validate the ScaledObject information (preventing collisions on HPAs, wrong cpu memory config, etc)

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 month ago

This issue has been automatically closed due to inactivity.