default-kne-trigger dispatcher with kafka OOMKill / CrashLoopBackOff

p4p4 commented 4 years ago

Description

Kyma version: 1.13.0 in Azure K8s
Eveting config: Kafka based with Azure Event Hub
Eventing source: CCv2
Events (order.created etc) were dispatched really slowly to the subscriber (Order Data Service)
the pod in knative-eventing namespace default-kne-trigger-dgl-p1-dispatcher was in CrashLoopBackOff with many restarts (every couple seconds)
pod status: Readiness probe failed: HTTP probe failed with statuscode: 500
after manual restart, we noticed OOMKill
image version: knative-kafka-dispatcher:v0.12.2

see memory usage of the affected pod (always over the limit, most likely causing the restarts)

Expected result

events dispatched without extreme delay
dispatcher not crashing

Steps to reproduce

Troubleshooting

Solution so far

increase memory 128Mi to 1Gi
increase CPU 500m to 800m
after that, the pod was running successfully over the weekend and without any restarts, it continued around 520Mi

after changing that, the subscriber received traffic in a reasonable rate again.

k15r commented 4 years ago

If those changed limits solve the restart issues, then they need to be persisted by changing the configuration of the knative-eventing-kafka-channel-controller in the knative-eventing namespace:

> kubectl edit deployment -n knative-eventing knative-eventing-kafka-channel-controller 
spec:
    ...
  template:
    ...
    spec:
      containers:
      - env:
        ...
        - name: DISPATCHER_CPU_REQUEST
          value: 300m
        - name: DISPATCHER_CPU_LIMIT
          value: 500m
        - name: DISPATCHER_MEMORY_REQUEST
          value: 50Mi
        - name: DISPATCHER_MEMORY_LIMIT
          value: 128Mi

DISPATCHER_MEMORY_LIMIT 128Mi is the current default value.

Please adapt the DISPATCHER_MEMORY_LIMIT to the values that worked for your workload.

Please keep in mind:

This change will restart the dispatchers. So during the short time it takes to restart the dispatchers you will not be able to deliver events (not longer than a few seconds).
If you set the DISPATCHER_MEMORY_REQUEST value too high, you can run into kubernetes scheduling issues.
If you plan to add more subscriptions later, you might want to add some headroom to the DISPATCHER_MEMORY_LIMIT. For the current setup values of ~600Mi should suffice.

k15r commented 3 years ago

workaround delivered (updated channel-controller with configurable forced reconcile, plus updated limits as shown above)

kyma-project / kyma

default-kne-trigger dispatcher with kafka OOMKill / CrashLoopBackOff #9698