strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.84k stars 1.29k forks source link

Kafka-operator watch for failed "reconciliation" #8166

Closed SagdeevRR closed 1 year ago

SagdeevRR commented 1 year ago

Describe the bug Kafka-operator waiting for the end failed reconciliation MirrorMaker2 task.

This problem occurred when the node with the mirrormaker pod was restarted. The reconciliation continues indefinitely until it is restarted kafka-operator pod. And until this time mirrormaker doesn't work.

Expected behavior I think it's better to stop the running task and create a new one.

Environment:

YAML files and logs

Strimzi operator logs:

Feb 27, 2023 @ 09:19:48.163 | 2023-02-27 06:19:41 INFO  AbstractOperator:510 - Reconciliation #1805(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): reconciled
-- | --
  | Feb 27, 2023 @ 09:19:48.163 | 2023-02-27 06:19:41 INFO  AbstractOperator:239 - Reconciliation #1806(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): KafkaMirrorMaker2 mirrormaker2-kafka-mirrormaker-mirrormaker2 will be checked for creation or modification

  | Feb 27, 2023 @ 09:20:48.185 | 2023-02-27 06:20:41 INFO  AbstractOperator:380 - Reconciliation #1806(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): Reconciliation is in progress

  | Feb 27, 2023 @ 09:21:08.192 | 2023-02-27 06:21:07 WARN  KafkaMirrorMaker2AssemblyOperator:521 - Reconciliation #1806(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): Error reconciling MirrorMaker 2.0 mirrormaker2-kafka-mirrormaker-mirrormaker2

  | Feb 27, 2023 @ 09:21:08.192 | io.netty.channel.ConnectTimeoutException: connection timed out: mirrormaker2-kafka-mirrormaker-mirrormaker2-mirrormaker2-api.kafka-mirror.svc.cluster.local/10.98.130.219:8083

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]

  | Feb 27, 2023 @ 09:21:08.192 | at java.lang.Thread.run(Thread.java:833) ~[?:?]

  | Feb 27, 2023 @ 09:21:08.192 | 2023-02-27 06:21:07 INFO  CrdOperator:133 - Reconciliation #1806(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): Status of KafkaMirrorMaker2 mirrormaker2-kafka-mirrormaker-mirrormaker2 in namespace kafka-mirror has been updated

  | Feb 27, 2023 @ 09:21:08.192 | 2023-02-27 06:21:07 INFO  OperatorWatcher:38 - Reconciliation #1807(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): KafkaMirrorMaker2 mirrormaker2-kafka-mirrormaker-mirrormaker2 in namespace kafka-mirror was MODIFIED

  | Feb 27, 2023 @ 09:21:18.195 | 2023-02-27 06:21:12 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace *

  | Feb 27, 2023 @ 09:21:48.205 | 2023-02-27 06:21:41 INFO  AbstractOperator:380 - Reconciliation #1806(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): Reconciliation is in progress

  | Feb 27, 2023 @ 09:22:48.227 | 2023-02-27 06:22:41 INFO  AbstractOperator:380 - Reconciliation #1806(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): Reconciliation is in progress

Feb 27, 2023 @ 09:23:48.249 | 2023-02-27 06:23:41 INFO  AbstractOperator:380 - Reconciliation #1806(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): Reconciliation is in progress
-- | --
scholzj commented 1 year ago

What is the actual bug you want to report? You submitted just the template.

SagdeevRR commented 1 year ago

@scholzj Sorry, i added discription

scholzj commented 1 year ago

I think you would need to share the full log from the operator and MM2 and your configurations of Mirror Maker.

scholzj commented 1 year ago

Triaged on 9.3.2023: Not much to do about this without the logs and configuration. @SagdeevRR Please share the full logs and custom resources.

SagdeevRR commented 1 year ago

KafkaMirrorMaker2 configurations:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  annotations:
    meta.helm.sh/release-name: kafka-mirrormaker
    meta.helm.sh/release-namespace: kafka-mirror
    strimzi.io/pause-reconciliation: "false"
  creationTimestamp: "2023-02-20T16:12:26Z"
  generation: 12
  labels:
    app: mirrormaker2-kafka-mirrormaker-mirrormaker2
    app.kubernetes.io/component: mirrorMaker2
    app.kubernetes.io/instance: mirrormaker2-kafka-mirrormaker
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mirrormaker2
    app.kubernetes.io/repo-branch: unknown
    app.kubernetes.io/repo-last-commit: unknown
    app.kubernetes.io/repo-maintainer: unknown
    app.kubernetes.io/repo-name: unknown
    app.kubernetes.io/version: 0.1.0
    helm.sh/chart: mirrormaker2-0.1.0
  name: mirrormaker2-kafka-mirrormaker-mirrormaker2
  namespace: kafka-mirror
  resourceVersion: "575083077"
  uid: a4c1bdf9-0758-4ff9-8465-b08a3b9ec034
spec:
  clusters:
  - alias: mcs-kafka
    bootstrapServers: 10.0.0.10:6667
  - alias: yc-common-dev
    authentication:
      passwordSecret:
        password: YC_COMMON_DEV
        secretName: users-secret
      type: scram-sha-512
      username: admin-cloud
    bootstrapServers: rc1a-di2dbm26l5dcn99c.mdb.yandexcloud.net:9091
    config:
      config.storage.replication.factor: -1
      offset.storage.replication.factor: -1
      status.storage.replication.factor: -1
    tls:
      trustedCertificates:
      - certificate: YANDEX_TLS_CERT
        secretName: yandex-crt-secret
  connectCluster: yc-common-dev
  metricsConfig:
    type: jmxPrometheusExporter
    valueFrom:
      configMapKeyRef:
        key: metrics-config.yml
        name: mirrormaker2-kafka-mirrormaker-mirrormaker2-metrics
  mirrors:
  - checkpointConnector:
      autoRestart:
        enabled: true
      config:
        checkpoints.topic.replication.factor: 1
        refresh.groups.interval.seconds: 60
        replication.policy.class: io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy
        sync.group.offsets.enabled: "true"
        sync.group.offsets.interval.seconds: 1
    groupsPattern: ""
    heartbeatConnector:
      autoRestart:
        enabled: true
    sourceCluster: mcs-kafka
    sourceConnector:
      autoRestart:
        enabled: true
      config:
        offset-syncs.topic.replication.factor: 1
        replication.factor: 1
        replication.policy.class: io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy
        sync.topic.acls.enabled: "false"
      pause: false
      tasksMax: 10
    targetCluster: yc-common-dev
    topicsPattern: analytics-registration
  - checkpointConnector:
      autoRestart:
        enabled: true
      config:
        checkpoints.topic.replication.factor: 1
        refresh.groups.interval.seconds: 5
        replication.policy.class: io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy
        sync.group.offsets.enabled: "true"
        sync.group.offsets.interval.seconds: 1
    groupsPattern: dev-api-service-.*|payment-service-.*|np-consumer-group_test|qc-triggers-group_test|account-dev-.*|auction-service|buyer-reader|django-admin-panel-group|sku-sync-processor|wms.*|arka-integration-.*|feedback-automoderator-test|temp-identifier-consumer|finance-service|foodback-service|seller-notification-service|sku-sync-processor
    heartbeatConnector:
      autoRestart:
        enabled: true
    sourceCluster: mcs-kafka
    sourceConnector:
      autoRestart:
        enabled: true
      config:
        offset-syncs.topic.replication.factor: 1
        replication.factor: 1
        replication.policy.class: io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy
        sync.topic.acls.enabled: "false"
      pause: false
      tasksMax: 50
    targetCluster: yc-common-dev
    topicsPattern: accepted_invoice_test|analytics_buyers_internal_test|analytics-buyers_test|analytics_cart-update_internal_test|analytics_details-update_internal_test|analytics_new-buyers_internal_test|analytics_order-canceled_internal_test|analytics_order-completed_internal_test|analytics_order-created_internal_test|analytics_order-not-created_internal_test|analytics_order-ready_internal_test|analytics_purchases_internal_test|analytics_refunds_internal_test|analytics_registration_internal_test|analytics_settings-update_internal_test|analytics_wish-list_internal_test|bundle_complete_test_dev|bundle_complete_test_dev_test|calendar_event_activate_test_dev|calendar_event_activate_test_dev_test|calendat_event_complete_test_dev|calendat_event_complete_test_dev_test|cancel_order_test_dev|cancel_order_test_dev_test|category_stat_update_test|discount_test|dshop_offer_update_test|dshop_product_sync_test|edo_processing_test|email_test|finance-bonus-decrease-quartz-cron-triggers_test|finance-bonus-decrease-quartz-cron-triggers|finance-bonus-increase-quartz-cron-triggers_test|finance-bonus-increase-quartz-cron-triggers|foodback-quartz-cron-triggers_test|foodback_test|identifier_event_test|invoice_status_test|notification_test|order_promo_test|order_return_test|order_status_test|order_status_test.DLT|order_storage_date_test|payment_retry_test|payment-service-poll-operation_test|payment_status_test|permissions_test|product_field_test|product_stat_update_sync_test|psb_sync_test|push_fast_test|push_slow_test|photo_sku_reservation_test|schedule_auction_test|schedule_test|schedule_time_slot_test|seller_discount_activate_test_dev|seller_discount_activate_test_dev.DLT|seller_discount_activate_test_dev_test|seller_discount_complete_test_dev|seller_discount_complete_test_dev.DLT|seller_discount_complete_test_dev_test|seller_finance_notification_test|seller_leftover_notification_test|seller_new_notification_test|seller-notification-feedback-count-without-reply_test|seller-notification-new-reviews_test|seller_return_notification_test|seller_return_test|seller_sync_test|shop_employee_test|sku_stat_update_sync_test|sku_stat_update_test|sku_update_test|sku_wms_update_test|subscription_test|timeslot_reserved_capacity_test|timeslot_reserved_capacity_test.DLT|finance-debit-decrease-quartz-cron-triggers|finance-debit-decrease-quartz-cron-triggers_test|finance-debit-increase-quartz-cron-triggers|finance-debit-increase-quartz-cron-triggers_test|auction_winners_test|auction_winners_test.DLT|schedule_auction_ad_marker_test|schedule_auction_ad_marker_test.DLT|schedule_auction_end_test|schedule_auction_end_test.DLT|schedule_auction_payment_test|schedule_auction_payment_test.DLT|schedule_auction_promotion_end_test|schedule_auction_promotion_end_test.DLT|schedule_auction_promotion_start_test|schedule_auction_promotion_start_test.DLT|schedule_auction_winners_test|schedule_auction_winners_test.DLT|schedule_product_cache_evict_test|foodback-rating_test|foodback-rating_test.DLT|product_stat_update_test|auction_outbox_processing_test|product_cache_evict_test|sku_price_update_test|sku_price_update_sync_test|
  - checkpointConnector:
      autoRestart:
        enabled: true
      config:
        checkpoints.topic.replication.factor: 1
        refresh.groups.interval.seconds: 5
        replication.policy.class: io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy
        sync.group.offsets.enabled: "true"
        sync.group.offsets.interval.seconds: 1
    groupsPattern: wms-acceptance-.*|arka-integration-.*|wms-order-.*|wms-courier-.*|dev-api-service-.*|wms-postamat-.*|wms-assembly-.*|wms-delivery-.*
    heartbeatConnector:
      autoRestart:
        enabled: true
    sourceCluster: mcs-kafka
    sourceConnector:
      autoRestart:
        enabled: true
      config:
        offset-syncs.topic.replication.factor: 1
        replication.factor: 1
        replication.policy.class: io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy
        sync.topic.acls.enabled: "false"
      pause: false
      tasksMax: 50
    targetCluster: yc-common-dev
    topicsPattern: seller_test|seller_test.DLT|wms_order_status_test|wms_order_status_test.DLT|order_status_test|order_status_test.DLT|accepted_invoice_test|accepted_invoice_test.DLT|invoice_test|invoice_test.DLT|sku_test|sku_test.DLT|sku_update_test|order_return_test|order_return_test.DLT|seller_return_test|seller_return_test.DLT|assembly_status_test|assembly_status_test.DLT|wall_cell_status_test|wall_cell_status_test.DLT|sku_wms_update_test|sku_wms_update_test.DLT|identifier_event_test|identifier_event_test.DLT|reservation_update_test|reservation_update_test.DLT|missing_update_test|missing_update_test.DLT|assembly_group_test|assembly_group_test.DLT|incident_test|incident_test.DLT|invoice_status_test|invoice_status_test.DLT|photo_studio_order_status_test|photo_studio_order_status_test.DLT|dev.reserve.cancel|dev.reserve.cancel.DLT|dev.ship|dev.ship.DLT|dev.ship.response|dev.invoices.create|dev.invoices.create.DLT|accepted_shipping_box|accepted_shipping_box.DLT|photo_sku_reservation_test|dev.boxtransfer|dev.boxtransfer.DLT|ke_wms_create_3pl_order_test|ke_wms_create_3pl_order_test.DLT|ke_wms_ali_russia_3pl_order_test|ke_wms_ali_russia_3pl_order_test.DLT|ke_wms_delivery_order_test|ke_wms_payment_order_test|ke_wms_payment_order_test.DLT|ke_wms_refund_order_test|ke_wms_cancel_order_test
  replicas: 3
  resources:
    limits:
      cpu: "1"
      memory: 3000M
    requests:
      cpu: 500m
      memory: 2000M
  template:
    pod:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: app_type
                operator: In
                values:
                - infra_services
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                - mirrorMaker2
            topologyKey: kubernetes.io/hostname
      imagePullSecrets:
      - name: gcr-ro-secret
      metadata:
        labels:
          app: mirrormaker2-kafka-mirrormaker-mirrormaker2
          app.kubernetes.io/component: mirrorMaker2
          app.kubernetes.io/instance: mirrormaker2-kafka-mirrormaker
          app.kubernetes.io/name: mirrormaker2
          app.kubernetes.io/repo-branch: unknown
          app.kubernetes.io/repo-last-commit: unknown
          app.kubernetes.io/repo-maintainer: unknown
          app.kubernetes.io/repo-name: unknown
          app.kubernetes.io/version: 0.1.0
          helm.sh/chart: mirrormaker2-0.1.0
      tolerations:
      - effect: NoSchedule
        key: app_type
        operator: Equal
        value: infra_services
  version: 3.2.1
status:
  conditions:
  - lastTransitionTime: "2023-03-13T13:45:23.207363776Z"
    status: "True"
    type: Ready
  connectors:
  - connector:
      state: RUNNING
      worker_id: 10.114.166.141:8083
    name: mcs-kafka->yc-common-dev.MirrorCheckpointConnector
    tasks: []
    type: source
  - connector:
      state: RUNNING
      worker_id: 10.114.166.141:8083
    name: mcs-kafka->yc-common-dev.MirrorCheckpointConnector
    tasks: []
    type: source
  - connector:
      state: RUNNING
      worker_id: 10.114.166.141:8083
    name: mcs-kafka->yc-common-dev.MirrorCheckpointConnector
    tasks:
    - id: 0
      state: RUNNING
      worker_id: 10.114.166.141:8083
    type: source
  - connector:
      state: RUNNING
      worker_id: 10.114.254.55:8083
    name: mcs-kafka->yc-common-dev.MirrorHeartbeatConnector
    tasks:
    - id: 0
      state: RUNNING
      worker_id: 10.114.181.228:8083
    type: source
  - connector:
      state: RUNNING
      worker_id: 10.114.254.55:8083
    name: mcs-kafka->yc-common-dev.MirrorHeartbeatConnector
    tasks:
    - id: 0
      state: RUNNING
      worker_id: 10.114.181.228:8083
    type: source
  - connector:
      state: UNASSIGNED
      worker_id: 10.114.254.55:8083
    name: mcs-kafka->yc-common-dev.MirrorHeartbeatConnector
    tasks:
    - id: 0
      state: RUNNING
      worker_id: 10.114.181.228:8083
    type: source
  - connector:
      state: RUNNING
      worker_id: 10.114.181.228:8083
    name: mcs-kafka->yc-common-dev.MirrorSourceConnector
    tasks:
    - id: 0
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 1
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 2
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 3
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 4
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 5
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 6
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 7
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 8
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 9
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 10
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 11
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 12
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 13
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 14
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 15
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 16
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 17
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 18
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 19
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 20
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 21
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 22
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 23
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 24
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 25
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 26
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 27
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 28
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 29
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 30
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 31
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 32
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 33
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 34
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 35
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 36
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 37
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 38
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 39
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 40
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 41
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 42
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 43
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 44
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 45
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 46
      state: RUNNING
      worker_id: 10.114.166.141:8083
    - id: 47
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 48
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 49
      state: RUNNING
      worker_id: 10.114.166.141:8083
    type: source
  - connector:
      state: RUNNING
      worker_id: 10.114.181.228:8083
    name: mcs-kafka->yc-common-dev.MirrorSourceConnector
    tasks:
    - id: 0
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 1
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 2
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 3
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 4
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 5
      state: RUNNING
      worker_id: 10.114.181.228:8083
    type: source
  - connector:
      state: RUNNING
      worker_id: 10.114.181.228:8083
    name: mcs-kafka->yc-common-dev.MirrorSourceConnector
    tasks:
    - id: 0
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 1
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 2
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 3
      state: RUNNING
      worker_id: 10.114.254.55:8083
    - id: 4
      state: RUNNING
      worker_id: 10.114.181.228:8083
    - id: 5
      state: RUNNING
      worker_id: 10.114.181.228:8083
    type: source
  labelSelector: strimzi.io/cluster=mirrormaker2-kafka-mirrormaker-mirrormaker2,strimzi.io/name=mirrormaker2-kafka-mirrormaker-mirrormaker2-mirrormaker2,strimzi.io/kind=KafkaMirrorMaker2
  observedGeneration: 12
  replicas: 3
  url: http://mirrormaker2-kafka-mirrormaker-mirrormaker2-mirrormaker2-api.kafka-mirror.svc:8083

Strimzi opertaor deploy:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "13"
  generation: 20
  labels:
    app: strimzi
    argocd.argoproj.io/instance: strimzi-operator
    chart: strimzi-kafka-operator-0.33.0
    component: deployment
    heritage: Helm
    release: strimzi-operator
  name: strimzi-cluster-operator
  namespace: infra-strimzi-operator
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: strimzi-cluster-operator
      strimzi.io/kind: cluster-operator
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2023-02-20T19:41:27+03:00"
      creationTimestamp: null
      labels:
        name: strimzi-cluster-operator
        strimzi.io/kind: cluster-operator
    spec:
      containers:
      - args:
        - /opt/strimzi/bin/cluster_operator_run.sh
        env:
        - name: STRIMZI_NAMESPACE
          value: '*'
        - name: STRIMZI_FULL_RECONCILIATION_INTERVAL_MS
          value: "120000"
        - name: STRIMZI_OPERATION_TIMEOUT_MS
          value: "300000"
        - name: STRIMZI_DEFAULT_TLS_SIDECAR_ENTITY_OPERATOR_IMAGE
          value: quay.io/strimzi/kafka:0.33.0-kafka-3.3.2
        - name: STRIMZI_DEFAULT_KAFKA_EXPORTER_IMAGE
          value: quay.io/strimzi/kafka:0.33.0-kafka-3.3.2
        - name: STRIMZI_DEFAULT_CRUISE_CONTROL_IMAGE
          value: quay.io/strimzi/kafka:0.33.0-kafka-3.3.2
        - name: STRIMZI_KAFKA_IMAGES
          value: |
            3.2.0=quay.io/strimzi/kafka:0.33.0-kafka-3.2.0
            3.2.1=quay.io/strimzi/kafka:0.33.0-kafka-3.2.1
            3.2.3=quay.io/strimzi/kafka:0.33.0-kafka-3.2.3
            3.3.1=quay.io/strimzi/kafka:0.33.0-kafka-3.3.1
            3.3.2=quay.io/strimzi/kafka:0.33.0-kafka-3.3.2
        - name: STRIMZI_KAFKA_CONNECT_IMAGES
          value: |
            3.2.0=quay.io/strimzi/kafka:0.33.0-kafka-3.2.0
            3.2.1=quay.io/strimzi/kafka:0.33.0-kafka-3.2.1
            3.2.3=quay.io/strimzi/kafka:0.33.0-kafka-3.2.3
            3.3.1=quay.io/strimzi/kafka:0.33.0-kafka-3.3.1
            3.3.2=quay.io/strimzi/kafka:0.33.0-kafka-3.3.2
        - name: STRIMZI_KAFKA_MIRROR_MAKER_IMAGES
          value: |
            3.2.0=quay.io/strimzi/kafka:0.33.0-kafka-3.2.0
            3.2.1=quay.io/strimzi/kafka:0.33.0-kafka-3.2.1
            3.2.3=quay.io/strimzi/kafka:0.33.0-kafka-3.2.3
            3.3.1=quay.io/strimzi/kafka:0.33.0-kafka-3.3.1
            3.3.2=quay.io/strimzi/kafka:0.33.0-kafka-3.3.2
        - name: STRIMZI_KAFKA_MIRROR_MAKER_2_IMAGES
          value: |
            3.2.0=quay.io/strimzi/kafka:0.33.0-kafka-3.2.0
            3.2.1=quay.io/strimzi/kafka:0.33.0-kafka-3.2.1
            3.2.3=quay.io/strimzi/kafka:0.33.0-kafka-3.2.3
            3.3.1=quay.io/strimzi/kafka:0.33.0-kafka-3.3.1
            3.3.2=quay.io/strimzi/kafka:0.33.0-kafka-3.3.2
        - name: STRIMZI_DEFAULT_TOPIC_OPERATOR_IMAGE
          value: quay.io/strimzi/operator:0.33.0
        - name: STRIMZI_DEFAULT_USER_OPERATOR_IMAGE
          value: quay.io/strimzi/operator:0.33.0
        - name: STRIMZI_DEFAULT_KAFKA_INIT_IMAGE
          value: quay.io/strimzi/operator:0.33.0
        - name: STRIMZI_DEFAULT_KAFKA_BRIDGE_IMAGE
          value: quay.io/strimzi/kafka-bridge:0.24.0
        - name: STRIMZI_DEFAULT_JMXTRANS_IMAGE
          value: quay.io/strimzi/jmxtrans:0.33.0
        - name: STRIMZI_DEFAULT_KANIKO_EXECUTOR_IMAGE
          value: quay.io/strimzi/kaniko-executor:0.33.0
        - name: STRIMZI_DEFAULT_MAVEN_BUILDER
          value: quay.io/strimzi/maven-builder:0.33.0
        - name: STRIMZI_OPERATOR_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: STRIMZI_FEATURE_GATES
        - name: STRIMZI_LEADER_ELECTION_ENABLED
          value: "true"
        - name: STRIMZI_LEADER_ELECTION_LEASE_NAME
          value: strimzi-cluster-operator
        - name: STRIMZI_LEADER_ELECTION_LEASE_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: STRIMZI_LEADER_ELECTION_IDENTITY
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: quay.io/strimzi/operator:0.33.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: strimzi-cluster-operator
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: strimzi-tmp
        - mountPath: /opt/strimzi/custom-config/
          name: co-config-volume
      dnsPolicy: ClusterFirst
      nodeSelector:
        app_type: infra_services
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: strimzi-cluster-operator
      serviceAccountName: strimzi-cluster-operator
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: app_type
        operator: Equal
        value: infra_services
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 1Mi
        name: strimzi-tmp
      - configMap:
          defaultMode: 420
          name: strimzi-cluster-operator
        name: co-config-volume
status:
  availableReplicas: 3
  conditions:
  - lastTransitionTime: "2022-10-05T16:10:32Z"
    lastUpdateTime: "2023-02-22T07:32:07Z"
    message: ReplicaSet "strimzi-cluster-operator-7f49b9df7f" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2023-03-14T23:44:24Z"
    lastUpdateTime: "2023-03-14T23:44:24Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 20
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3
SagdeevRR commented 1 year ago

Problem was observed at 06:18:31 in strimzi logs.

stimzi logs: strimzi-cluster-operator-7f49b9df7f-wznff.log kmm2 logs: KafkaMirrorMaker2.log

scholzj commented 1 year ago

Thanks for sharing the logs. Looks like at that time, your Mirror Maker cluster broken down - possibly because of an issue with one of the Kafka clusters. That probably caused some HTTP call to get stuck or exception was unhandled. The INFO level log does not give many details as to what could have been the issue. But we can have a look at it.

Your log seems to have also a lot of issues like this:

2023-03-14 03:38:37 WARN  AbstractOperator:525 - Reconciliation #1532(watch) KafkaMirrorMaker2(kafka-mirror/mirrormaker2-kafka-mirrormaker-mirrormaker2): Failed to reconcile
io.strimzi.operator.cluster.operator.assembly.ConnectRestException: GET /connectors/mcs-kafka->yc-common-dev.MirrorHeartbeatConnector/config returned 409 (Conflict): Cannot complete request momentarily due to stale configuration (typically caused by a concurrent config change)
    at io.strimzi.operator.cluster.operator.assembly.KafkaConnectApiImpl.lambda$doGet$6(KafkaConnectApiImpl.java:138) ~[io.strimzi.cluster-operator-0.33.0.jar:0.33.0]
    at io.vertx.core.impl.future.FutureImpl$1.onSuccess(FutureImpl.java:91) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:54) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.impl.EventLoopContext.execute(EventLoopContext.java:86) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.impl.ContextBase.execute(ContextBase.java:225) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:51) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:211) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.impl.future.PromiseImpl.tryComplete(PromiseImpl.java:23) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.http.impl.HttpEventHandler.handleEnd(HttpEventHandler.java:79) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.http.impl.HttpClientResponseImpl.handleEnd(HttpClientResponseImpl.java:250) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.lambda$new$0(Http1xClientConnection.java:398) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.streams.impl.InboundBuffer.handleEvent(InboundBuffer.java:239) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.streams.impl.InboundBuffer.write(InboundBuffer.java:129) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.handleEnd(Http1xClientConnection.java:645) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.impl.EventLoopContext.execute(EventLoopContext.java:76) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.impl.ContextBase.execute(ContextBase.java:232) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.http.impl.Http1xClientConnection.handleResponseEnd(Http1xClientConnection.java:879) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.http.impl.Http1xClientConnection.handleHttpMessage(Http1xClientConnection.java:751) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.http.impl.Http1xClientConnection.handleMessage(Http1xClientConnection.java:715) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.net.impl.ConnectionBase.read(ConnectionBase.java:157) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.vertx.core.net.impl.VertxHandler.channelRead(VertxHandler.java:153) ~[io.vertx.vertx-core-4.3.6.jar:4.3.6]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[io.netty.netty-codec-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[io.netty.netty-codec-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[io.netty.netty-handler-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[io.netty.netty-transport-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.86.Final.jar:4.1.86.Final]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]

That might suggest that you have a lot of consumer rebalances. But the Mirror Maker 2 log is too short to cover this, so not sure. It is unrelated to the stuck reconciliation - but might be something you want to look into.

SagdeevRR commented 1 year ago

ds-file-d-dev-cluster-apps-kafka-mirror-2023.03.14-000099.zip I attached more logs, mb will be helpful.

We can enable auto restart reconcile in strimzi operator? Restarting strimzi leader pod usually helps

scholzj commented 1 year ago

Triaged on 23.3.2023: Looks like a bug, although not completely clear where. We should try to go through the code to see if there is any place where we might not handle the error properly.

scholzj commented 1 year ago

I tried to look into this. But TBH I'm a bit confused:


I found a few places where this could have gotten stuck. Without complete logs at DEBUG level, I have no 100% certainty that this mirrors the original problem reported in this issue. But I opened a PR for it and we can get back to this if it happens again.