strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.84k stars 1.29k forks source link

[Bug]: Connect Pods Keeps Getting Restarted #9361

Closed erezblm closed 11 months ago

erezblm commented 11 months ago

Bug Description

Hi,

My 2 Connect Cluster pods keeps getting restarted every 10-20 minutes.

After I managed somehow store logs locally, I was able tot see the logs of the previous terminated pods, but it doesn't seem that they had any errors - so I assume that the operator just keep restarting them for some reason.

I'm running Connect cluster with 2 replicas and multiple connectors of 2 kinds (MQTT source and Clickhouse Sink), each with multiple tasks.

I would appreciate some help to figure out even how to debug it, because I couldn't understand exactly from the operator logs where the restart occured.

Attached the operator debug logs (connect logs didn't seem to have anything interesting but can upload as well)

Steps to reproduce

No response

Expected behavior

Connect pods keep running without being restarted

Strimzi version

0.37

Kubernetes version

1.26.6

Installation method

Terraform Provider

Infrastructure

No response

Configuration files and logs

Connect Spec + status(taken from edit):

Name:         connect-cluster
Namespace:    kafka
Labels:       <none>
Annotations:  strimzi.io/use-connector-resources: true
API Version:  kafka.strimzi.io/v1beta2
Kind:         KafkaConnect
Metadata:
  Creation Timestamp:  2023-11-05T07:13:45Z
  Generation:          2
  Resource Version:    25965037
  UID:                 6e4fd6ef-248c-4795-9cde-dc496819e54f
Spec:
  Bootstrap Servers:  kafka-cluster-kafka-bootstrap:9092
  Config:
    config.storage.replication.factor:        3
    connector.client.config.override.policy:  All
    group.id:                                 kafka-cluster
    key.converter:                            org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable:             true
    offset.storage.replication.factor:        3
    status.storage.replication.factor:        3
    value.converter:                          org.apache.kafka.connect.json.JsonConverter
    value.converter.schemas.enable:           true
  Image:                                      gl.tigoenergy.com:5050/tigo/core/strimzi-kafka-connectors:latest
  Liveness Probe:
    Failure Threshold:  15
    Period Seconds:     10
  Logging:
    Loggers:
      log4j.logger.com.datamountaineer.streamreactor:  INFO
      log4j.logger.io.lenses.streamreactor:            INFO
    Type:                                              inline
  Metrics Config:
    Type:  jmxPrometheusExporter
    Value From:
      Config Map Key Ref:
        Key:   connect-metrics-config.yaml
        Name:  connect-metrics
  Rack:
    Topology Key:  topology.kubernetes.io/zone
  Readiness Probe:
    Failure Threshold:  15
    Period Seconds:     10
  Replicas:             2
  Resources:
    Limits:
      Memory:  6Gi
    Requests:
      Cpu:     200m
      Memory:  2Gi
  Template:
    Pod:
      Affinity:
        Node Affinity:
          Required During Scheduling Ignored During Execution:
            Node Selector Terms:
              Match Expressions:
                Key:       app
                Operator:  In
                Values:
                  kafka-consumers
        Pod Anti Affinity:
          Required During Scheduling Ignored During Execution:
            Label Selector:
              Match Expressions:
                Key:       strimzi.io/name
                Operator:  In
                Values:
                  connect-cluster-connect
            Topology Key:  kubernetes.io/hostname
      Image Pull Secrets:
        Name:  docker-registry
      Tolerations:
        Effect:    NoSchedule
        Key:       kubernetes.azure.com/scalesetpriority
        Operator:  Equal
        Value:     spot
  Tls:
    Trusted Certificates:
      Certificate:  ca.crt
      Secret Name:  kafka-cluster-cluster-ca-cert
Status:
  Conditions:
    Last Transition Time:  2023-11-17T20:39:40.245803762Z
    Status:                True
    Type:                  Ready
  Connector Plugins:
    Class:              com.clickhouse.kafka.connect.ClickHouseSinkConnector
    Type:               sink
    Version:            0.0.1
    Class:              com.datamountaineer.streamreactor.connect.mqtt.sink.MqttSinkConnector
    Type:               sink
    Version:            4.2.0-6-ged76a238
    Class:              com.datamountaineer.streamreactor.connect.mqtt.source.MqttSourceConnector
    Type:               source
    Version:            4.2.0-6-ged76a238
    Class:              org.apache.kafka.connect.mirror.MirrorCheckpointConnector
    Type:               source
    Version:            3.4.0
    Class:              org.apache.kafka.connect.mirror.MirrorHeartbeatConnector
    Type:               source
    Version:            3.4.0
    Class:              org.apache.kafka.connect.mirror.MirrorSourceConnector
    Type:               source
    Version:            3.4.0
  Label Selector:       strimzi.io/cluster=connect-cluster,strimzi.io/name=connect-cluster-connect,strimzi.io/kind=KafkaConnect
  Observed Generation:  2
  Replicas:             2
  URL:                  http://connect-cluster-connect-api.kafka.svc:8083
Events:                 <none>

One of the connector's example:

Name:         datasource-panels-min
Namespace:    kafka
Labels:       k8slens-edit-resource-version=v1beta2
              strimzi.io/cluster=connect-cluster
Annotations:  <none>
API Version:  kafka.strimzi.io/v1beta2
Kind:         KafkaConnector
Metadata:
  Creation Timestamp:  2023-11-05T07:14:59Z
  Generation:          4
  Resource Version:    27009030
  UID:                 c808a8ca-7503-4b11-9c23-81ecb6caf420
Spec:
  Auto Restart:
    Enabled:  true
  Class:      com.clickhouse.kafka.connect.ClickHouseSinkConnector
  Config:
    consumer.override.allow.auto.create.topics:   false
    consumer.override.fetch.max.wait.ms:          8000
    consumer.override.fetch.min.bytes:            3000000
    consumer.override.max.partition.fetch.bytes:  50000000
    consumer.override.max.poll.records:           100000
    Database:                                     #HIDDEN#
    errors.retry.timeout:                         30
    Exactly Once:                                 true
    Hostname:                                     #HIDDEN#
    key.converter:                                org.apache.kafka.connect.storage.StringConverter
    Password:                                     #HIDDEN#
    Port:                                         8123
    schemas.enable:                               false
    Ssl:                                          false
    Topics:                                       #HIDDEN#
    Username:                                     #HIDDEN#
    value.converter:                              org.apache.kafka.connect.json.JsonConverter
    value.converter.schemas.enable:               false
  Tasks Max:                                      5
Status:
  Conditions:
    Last Transition Time:  2023-11-19T11:37:54.589582430Z
    Status:                True
    Type:                  Ready
  Connector Status:
    Connector:
      State:      RUNNING
      worker_id:  connect-cluster-connect-1.connect-cluster-connect.kafka.svc:8083
    Name:         datasource-panels-min
    Tasks:
      Id:               0
      State:            RUNNING
      worker_id:        connect-cluster-connect-1.connect-cluster-connect.kafka.svc:8083
      Id:               1
      State:            RUNNING
      worker_id:        connect-cluster-connect-1.connect-cluster-connect.kafka.svc:8083
      Id:               2
      State:            RUNNING
      worker_id:        connect-cluster-connect-1.connect-cluster-connect.kafka.svc:8083
      Id:               3
      State:            RUNNING
      worker_id:        connect-cluster-connect-1.connect-cluster-connect.kafka.svc:8083
      Id:               4
      State:            RUNNING
      worker_id:        connect-cluster-connect-1.connect-cluster-connect.kafka.svc:8083
    Type:               sink
  Observed Generation:  4
  Tasks Max:            5
  Topics:
    datasource_panels_min
Events:  <none>

Operator debug logs (restart occurred around 10:40): operatorlogs.txt

Connect pods last logs before termination connect-1.log connect-0.log

Additional context

No response

scholzj commented 11 months ago

The operator log seems to not contain any rolling of the Kafka Connect pods ... if you look for KafkaConnectRoller logs, it always seems to be saying the pods do not need to be rolled.

The Connect logs seem to suggest the pods were stopped. But since their logs do not seem overlap with the operator log n terms of the time, it is hard to say if it was the operator or not. It could have been also something else. If it is really the operator doing this, you would need to provide some logs where it overlaps and covers the situation from both ends. You can also check the Kubernetes events that might suggest something else stopped the Connect pods.

erezblm commented 11 months ago

Thanks, i’ll try and add the overlapping logs tomorrow.. I don’t think it’s rollout because the pods are terminated separately and not immediately one after the other. I thought it might be related to ‘enableRestart’, but i couldn’t find any errors, and I would expect it to restart just the tasks and not the whole pod.

scholzj commented 11 months ago

What do you mean with enableRestart?

scholzj commented 11 months ago

Discussed on the community call on 30.11.2023: Can you please clarify what exactly you meant with the enableRestart reference? Otherwise, there does not seem to be much more we can do about this based on the information we have and we will close it.

scholzj commented 11 months ago

Discussed in the community call on 14.12.: No more information received since last time. We are going to close it. Feel free to reopen it or start a discussion if you can provide more details.