Error on message lag on consumer offsets

HighWatersDev commented 1 year ago

Hi,

I'm running kminion v2.2.0 and it started off just fine. However, after time, I'm getting these errors:

{"level":"info","ts":"2022-09-19T16:08:11.957Z","logger":"main.storage","msg":"Tried to fetch consumer group offsets, but haven't consumed the whole topic yet"}
{"level":"info","ts":"2022-09-19T16:08:12.031Z","logger":"main.minion_service","msg":"catching up the message lag on consumer offsets","lagging_partitions_count":1,"lagging_partitions":[{"Name":"__consumer_offsets","Id":6,"Lag":328}],"total_lag":328}

values.yaml

deployment:
  volumes:
    secrets:
      - secretName: kafka-tls
        mountPath: /secret/tls
kminion:
  config:
    kafka:
      brokers:
        - kafka-cluster-sc-kafka-0.kafka-cluster-sc-kafka-brokers.kafka.svc.cluster.local:9094
        - kafka-cluster-sc-kafka-1.kafka-cluster-sc-kafka-brokers.kafka.svc.cluster.local:9094
        - kafka-cluster-sc-kafka-2.kafka-cluster-sc-kafka-brokers.kafka.svc.cluster.local:9094
      clientId: "kminion"
      tls:
        enabled: true
        caFilepath: "/secret/tls/ca.crt"
        certFilepath: "/secret/tls/tls.crt"
        keyFilepath: "/secret/tls/tls.key"
    minion:
      consumerGroups:
        enabled: true
        scrapeMode: offsetsTopic # Valid values: adminApi, offsetsTopic
        granularity: partition
        allowedGroups: [ ".*" ]
        ignoredGroups: [ ]
      topics:
        granularity: partition
        allowedTopics: [ ".*" ]
        ignoredTopics: [ ]
        infoMetric:
          configKeys: [ "cleanup.policy" ]
      logDirs:
        enabled: true

      endToEnd:
        enabled: true
        probeInterval: 100ms
        topicManagement:
          enabled: true
          name: kminion-end-to-end
          reconciliationInterval: 10m
          replicationFactor: 1
          partitionsPerBroker: 1

        producer:
          ackSla: 5s
          requiredAcks: all

        consumer:
          groupIdPrefix: kminion-end-to-end
          deleteStaleConsumerGroups: false
          roundtripSla: 20s
          commitSla: 10s
serviceMonitor:
  create: true
  additionalLabels:
    release: prom-stack

reidmeyer commented 1 year ago

any progress on this issue? I'm having a similar issue..

weeco commented 1 year ago

Hello, this is an informational log message as far as I can tell. Is there any impact due to this?

This message indicates that it's not able to consume this specific partition. I'm not sure what the reason for this may be in your cluster, but you could also change the scrape mode to use the Kafka API rather than consuming the consumer offsets topic.

reidmeyer commented 1 year ago

Hi weeco,

My issue is indeed when i'm using the offsetsTopic as the scrape mode.

My application scrapes many partitions of the consumer_offsets topic and then get stuck at 2 partitions as I show below:

{"level":"info","ts":"2023-06-13T13:18:54.490Z","logger":"main.minion_service","msg":"catching up the message lag on consumer offsets","lagging_partitions_count":2,"lagging_partitions":[{"Name":"__consumer_offsets","Id":35,"Lag":11450323},{"Name":"__consumer_offsets","Id":3,"Lag":11790495}],"total_lag":23240818}

Although the log is informational, this makes it so that the pod never enters the ready state. /ready returns a 503 and the pod cannot be hit from outside because of this.

Changing the scrape mode is a valid solution for me. Are you aware of downsides to using the adminAPI, other than missing kminion_kafka_consumer_group_offset_commits_total metric?

michaeljwood commented 1 year ago

I ran into this, and I was able to recover by changing the leader of the partition that was reporting the lag. I'm not sure yet if there is a problem with that particular broker, as this was the only problem I was having in a fairly active cluster. To be fair though, this also means I lost my dashboards/alerting for a bit, so it's possible there were some other issues I just didn't catch.

I had first tried just restarting the leader broker, but that didn't seem to help at all.

weeco commented 1 year ago

Changing the scrape mode is a valid solution for me. Are you aware of downsides to using the adminAPI, other than missing kminion_kafka_consumer_group_offset_commits_total metric?

No side-effects besides less information that is accessible (number of commits), in fact most Kafka exporter just use the kafka API because that's way easier to implement.

reidmeyer commented 1 year ago

Thanks for sharing the info @michaeljwood and @weeco. I might try restarting my brokers one by one to trigger an assignment of a new leader, or find another way to assign a new leader..

weeco commented 6 months ago

@reidmeyer It seems very rare and specific to very few Kafka environments. It's unclear why this happens, but I have very little information in regards to what I could possibly look at. The code looks fine and works well in many other clusters. My recommendation is to use the default scrapeMode = kafkaApi instead, if the offsetsTopic scrape mode is causing issues.

redpanda-data / kminion

Error on message lag on consumer offsets #167