strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.8k stars 1.28k forks source link

[Bug]: Reconciliation occurs repeatedly #10576

Closed wSedlacek closed 4 weeks ago

wSedlacek commented 4 weeks ago

Bug Description

This seems to be a regression from 0.41.0 to 0.43.0 but after upgrading we noticed API Server calls from strimizi drastically spike with get, watch, and patch methods on the Kafka resource.

Steps to reproduce

  1. Deploy a Kafka resource
  2. Monitor operator logs/api server logs to see repeated reconciliation

Expected behavior

The Reconciliation should not occur over and over again

Strimzi version

0.43.0

Kubernetes version

1.29.4

Installation method

Helm Chart

Infrastructure

EKS

Configuration files and logs

2024-09-12 19:26:29 INFO  AbstractOperator:264 - Reconciliation #4622(watch) Kafka(prod/eventing): Kafka eventing will be checked for creation or modification
2024-09-12 19:26:30 INFO  AbstractOperator:510 - Reconciliation #4630(watch) Kafka(prod/eventing): Kafka eventing in namespace prod was MODIFIED
2024-09-12 19:26:30 INFO  CrdOperator:123 - Reconciliation #4622(watch) Kafka(prod/eventing): Status of Kafka eventing in namespace prod has been updated
2024-09-12 19:26:30 INFO  AbstractOperator:264 - Reconciliation #4623(watch) Kafka(prod/eventing): Kafka eventing will be checked for creation or modification
2024-09-12 19:26:30 INFO  AbstractOperator:536 - Reconciliation #4622(watch) Kafka(prod/eventing): reconciled
2024-09-12 19:26:31 INFO  CrdOperator:123 - Reconciliation #4623(watch) Kafka(prod/eventing): Status of Kafka eventing in namespace prod has been updated
2024-09-12 19:26:31 INFO  AbstractOperator:510 - Reconciliation #4631(watch) Kafka(prod/eventing): Kafka eventing in namespace prod was MODIFIED
2024-09-12 19:26:31 INFO  AbstractOperator:536 - Reconciliation #4623(watch) Kafka(prod/eventing): reconciled
2024-09-12 19:26:31 INFO  AbstractOperator:264 - Reconciliation #4624(watch) Kafka(prod/eventing): Kafka eventing will be checked for creation or modification
2024-09-12 19:26:32 INFO  CrdOperator:123 - Reconciliation #4624(watch) Kafka(prod/eventing): Status of Kafka eventing in namespace prod has been updated
2024-09-12 19:26:32 INFO  AbstractOperator:510 - Reconciliation #4632(watch) Kafka(prod/eventing): Kafka eventing in namespace prod was MODIFIED
2024-09-12 19:26:32 INFO  AbstractOperator:536 - Reconciliation #4624(watch) Kafka(prod/eventing): reconciled
2024-09-12 19:26:32 INFO  AbstractOperator:264 - Reconciliation #4625(watch) Kafka(prod/eventing): Kafka eventing will be checked for creation or modification
2024-09-12 19:26:33 INFO  AbstractOperator:510 - Reconciliation #4633(watch) Kafka(prod/eventing): Kafka eventing in namespace prod was MODIFIED
2024-09-12 19:26:33 INFO  CrdOperator:123 - Reconciliation #4625(watch) Kafka(prod/eventing): Status of Kafka eventing in namespace prod has been updated
2024-09-12 19:26:33 INFO  AbstractOperator:536 - Reconciliation #4625(watch) Kafka(prod/eventing): reconciled
2024-09-12 19:26:33 INFO  AbstractOperator:264 - Reconciliation #4626(watch) Kafka(prod/eventing): Kafka eventing will be checked for creation or modification

Additional context

This is occurring in our live clusters but does not occur locally when running with k3d. Perhaps this is latency related? Or related to specific cloud providers?

scholzj commented 4 weeks ago

This looks like something that is triggering some constant change of the resource. We can have a look into it, but you would need to provide more information. Probably one of these:

wSedlacek commented 4 weeks ago

So the only thing being changed between the loops is the lastTransitionTime for the Ready status and the resourceVersion.

Screenshot 2024-09-12 at 12 44 07
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  annotations:
    meta.helm.sh/release-name: eventing
    meta.helm.sh/release-namespace: prod
    strimzi.io/kraft: enabled
    strimzi.io/node-pools: enabled
  creationTimestamp: '2024-09-10T19:50:40Z'
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: eventing
  namespace: prod
  resourceVersion: '130238924'
  uid: 63e43c07-4c6e-467c-bbf9-ca134bb4b57d
spec:
  cruiseControl: {}
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    config:
      default.replication.factor: 1
      message.max.bytes: 18874368
      min.insync.replicas: 1
      offsets.topic.replication.factor: 1
      replica.fetch.max.bytes: 18874368
      transaction.state.log.min.isr: 1
      transaction.state.log.replication.factor: 1
    listeners:
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: tls
        port: 9093
        tls: true
        type: internal
    version: 3.7.0
status:
  clusterId: isVkwSVdR7W2DtC0uuwAGQ
  conditions:
    - lastTransitionTime: '2024-09-12T19:40:23.620988196Z'
      status: 'True'
      type: Ready
  kafkaMetadataState: KRaft
  kafkaMetadataVersion: 3.7-IV4
  kafkaNodePools:
    - name: eventing
  kafkaVersion: 3.7.0
  listeners:
    - addresses:
        - host: eventing-kafka-bootstrap.prod.svc
          port: 9092
      bootstrapServers: eventing-kafka-bootstrap.prod.svc:9092
      name: plain
    - addresses:
        - host: eventing-kafka-bootstrap.prod.svc
          port: 9093
      bootstrapServers: eventing-kafka-bootstrap.prod.svc:9093
      certificates:
        - omitted
      name: tls
  observedGeneration: 1
  operatorLastSuccessfulVersion: 0.43.0
scholzj commented 4 weeks ago

Ahh ... did you installed the new CRDs when you upgraded to 0.43.0?

wSedlacek commented 4 weeks ago

I think that was it! I had updated the helm chart but forgot the CRDs are not updated along side the chart. Things seem much more stable now! Thank you so much!

muriloecfaria commented 3 weeks ago

Ahh ... did you installed the new CRDs when you upgraded to 0.43.0?

How to install the new CRDs? I installed via helm, and I'm with the same issue

scholzj commented 3 weeks ago

@muriloecfaria You can get the YAML with the CRDs from the GitHub release page. For 0.43, that would be for example here: https://github.com/strimzi/strimzi-kafka-operator/releases/tag/0.43.0. They are also on the GitHub repo itself. You can just do kubectl apply -f on the YAML file.

They are also in the Helm chart, but Helm does not upgrade the CRDs and I do not use Helm myself so not sure if there is some simple way to extract them from the Helm chart archive.