kafka-minion stuck on "Offsets topic has not yet been consumed until the end" for days

ChrisBlom commented 5 years ago

Hi, after succesfully testing kafka-minion locally and in a test environment, I deployed kafka-minion in a existing production environment with a long running Kafka cluster.

The test environment was created just for the purpose of evaluation kafka-minion, so there was not much data in the __consumer_offsets topic, and it started pretty much instantly.

In the production environment the Kakfa cluster has been running for months, so I expected that it would take some time for kafka-minion to start (in other issues people mentioned startup times in the 10 minute range), but in my case it is not ready after 3 days of running.

The logs over the last 3 days only show this:

kafka-minion[30248]: {"level":"info","msg":"Offets topic has not yet been consumed until the end","time":"2019-08-29T08:40:56Z"}

My question is what could cause kafka-minion to take so long to initialize, and what can I do to fix this?

weeco commented 5 years ago

Hey, running Kafka minion against a cluster which runs for several months or even years shouldn't be a problem. The topic is by default configured to be compacted so that you usually only have a few million consumer offset messages. If you assign 2 cpu cores to kafka minion it should be able to process roughly 100k messages / second.

In your case I'd take a look at the exposed Prometheus metrics. Kafka minion has few internal metrics such as the number of consumed offset messages which would be interesting in your case. You should already see a graph for that in the default grafana dashboard. What does it show after kafka minion has been started and running for some minutes?

ChrisBlom commented 5 years ago

Thanks for the quick reply, this is what I see after it starts:

kafka-minion-progress

There is one error message is the logs, could that indicate a problem or is it a transient error?

Aug 29 12:03:55 kafka-zookeeper01 systemd[1]: Started kafka-minion.
Aug 29 12:03:55 kafka-zookeeper01 kafka-minion[17955]: {"level":"info","msg":"Starting kafka minion version1.0.0","time":"2019-08-29T12:03:55Z"}
Aug 29 12:03:55 kafka-zookeeper01 kafka-minion[17955]: {"address":"kafka.internal:9092","level":"info","module":"cluster","msg":"connecting to kafka cluster","time":"2019-08-29T12:03:55Z"}
Aug 29 12:03:55 kafka-zookeeper01 kafka-minion[17955]: {"address":"kafka.internal:9092","level":"info","module":"cluster","msg":"successfully connected to kafka cluster","time":"2019-08-29T12:03:55Z"}
Aug 29 12:03:55 kafka-zookeeper01 kafka-minion[17955]: {"address":"kafka.internal:9092","level":"info","module":"offset_consumer","msg":"Connecting to kafka cluster","time":"2019-08-29T12:03:55Z"}
Aug 29 12:03:55 kafka-zookeeper01 kafka-minion[17955]: {"error":"no brokers available","level":"warning","module":"cluster","msg":"failed to get active broker to refresh topic config","time":"2019-08-29T12:03:55Z"}
Aug 29 12:03:56 kafka-zookeeper01 kafka-minion[17955]: {"address":"kafka.internal:9092","level":"info","module":"offset_consumer","msg":"Successfully connected to kafka cluster","time":"2019-08-29T12:03:56Z"}
Aug 29 12:03:56 kafka-zookeeper01 kafka-minion[17955]: {"count":50,"level":"info","msg":"Starting '50' partition consumers","time":"2019-08-29T12:03:56Z","topic":"__consumer_offsets"}
Aug 29 12:03:56 kafka-zookeeper01 kafka-minion[17955]: {"count":50,"level":"info","msg":"Spawned all consumers","time":"2019-08-29T12:03:56Z","topic":"__consumer_offsets"}
Aug 29 12:03:56 kafka-zookeeper01 kafka-minion[17955]: {"level":"info","msg":"Listening on: '0.0.0.0:7070","time":"2019-08-29T12:03:56Z"}
Aug 29 12:03:56 kafka-zookeeper01 kafka-minion[17955]: {"level":"info","module":"storage","msg":"Registered 50 __consumer_offsets partitions which have to be consumed before metrics can be exposed","time":"2019-08-29T12:03:56Z"}
Aug 29 12:03:56 kafka-zookeeper01 kafka-minion[17955]: {"level":"info","msg":"Offets topic has not yet been consumed until the end","time":"2019-08-29T12:03:56Z"}
Aug 29 12:04:11 kafka-zookeeper01 kafka-minion[17955]: {"level":"info","msg":"Offets topic has not yet been consumed until the end","time":"2019-08-29T12:04:11Z"}
Aug 29 12:04:26 kafka-zookeeper01 kafka-minion[17955]: {"level":"info","msg":"Offets topic has not yet been consumed until the end","time":"2019-08-29T12:04:26Z"}
Aug 29 12:04:41 kafka-zookeeper01 kafka-minion[17955]: {"level":"info","msg":"Offets topic has not yet been consumed until the end","time":"2019-08-29T12:04:41Z"}

> curl localhost:7070/readycheck
Offsets topic has not been consumed yet

weeco commented 5 years ago

Hey @ChrisBlom , I am sorry for the late response, I just came back from holidays.

The logs and prometheus metrics look fine. So what happens upon invocation of the /metrics endpoint is Kafka Minion checking whether it has already consumed the offsets topic to the end. This way we ensure that we do not report wrong metrics (like Burrow does during the startup phase). When you enable the debug logs we should get more information why one or more partitions are not being marked as ready: https://github.com/cloudworkz/kafka-minion/blob/5a0eb7b9210ddf7dd9e7e48321ae58baeeb14918/kafka/offset_consumer.go#L149-L160

Be warned: the debug logs could be quite verbose though.

ChrisBlom commented 5 years ago

Thanks, i've enabled debug logging, it seems that there is problem with consuming one specific partition:

> journalctl -y kafka-minion | grep caught
Sep 02 07:50:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:50:25Z"}
Sep 02 07:50:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:50:55Z"}
Sep 02 07:51:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:51:25Z"}
Sep 02 07:51:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:51:55Z"}
Sep 02 07:52:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:52:25Z"}
Sep 02 07:52:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:52:55Z"}
Sep 02 07:53:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:53:25Z"}
Sep 02 07:53:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:53:55Z"}
Sep 02 07:54:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:54:25Z"}
Sep 02 07:54:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:54:55Z"}
Sep 02 07:55:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:55:25Z"}
Sep 02 07:55:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:55:55Z"}
Sep 02 07:56:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:56:25Z"}
Sep 02 07:56:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:56:55Z"}
Sep 02 07:57:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:57:25Z"}
Sep 02 07:57:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:57:55Z"}
Sep 02 07:58:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:58:25Z"}
Sep 02 07:58:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:58:55Z"}
Sep 02 07:59:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:59:25Z"}
Sep 02 07:59:30 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:59:30Z"}
Sep 02 07:59:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T07:59:55Z"}
Sep 02 08:00:00 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T08:00:00Z"}
Sep 02 08:00:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T08:00:25Z"}
Sep 02 08:00:30 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T08:00:30Z"}
Sep 02 08:00:55 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T08:00:55Z"}
Sep 02 08:01:00 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T08:01:00Z"}
Sep 02 08:01:25 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T08:01:25Z"}
Sep 02 08:01:30 kafka-minion[23769]: {"consumed_offset":967,"high_water_mark":968,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":1,"time":"2019-09-02T08:01:30Z"}

Do you have any ideas on what could cause this?

ChrisBlom commented 5 years ago

Ok, strange, but the offset topic has now been consumed now, i'll investigate a bit more and let you know if I have problems with it again.

UARedko commented 5 years ago

It seems I'm facing the same issue. Although it works fine for my dev environment, it doesn't work in prod. It's been running for two days already, but It can't complete reading.

I see this error in the logs:

{"broker_id":8,"error":"kafka: broker not connected","level":"error","module":"cluster","msg":"failed to fetch low watermarks from broker","time":"2019-10-04T14:52:55Z"}

And also I run it with debug log level: {"consumed_offset":0,"high_water_mark":5019255,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":22,"remaining_lag":5019255,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":80855807,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":10,"remaining_lag":80855807,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":17194318,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":34,"remaining_lag":17194318,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":18691333,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":1,"remaining_lag":18691333,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":24174436,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":37,"remaining_lag":24174436,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":153134,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":19,"remaining_lag":153134,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":28461577,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":43,"remaining_lag":28461577,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":39447915,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":7,"remaining_lag":39447915,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":9020359,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":25,"remaining_lag":9020359,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":5431960,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":2,"remaining_lag":5431960,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":32726148,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":13,"remaining_lag":32726148,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":7724502,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":31,"remaining_lag":7724502,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":23191398,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":14,"remaining_lag":23191398,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":14121843,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":39,"remaining_lag":14121843,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":6455637,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":45,"remaining_lag":6455637,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":24351947,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":44,"remaining_lag":24351947,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":25792564,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":41,"remaining_lag":25792564,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":39523711,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":38,"remaining_lag":39523711,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":411311,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":29,"remaining_lag":411311,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":1863925,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":0,"remaining_lag":1863925,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":563072,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":23,"remaining_lag":563072,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":62469941,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":11,"remaining_lag":62469941,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":384110285,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":5,"remaining_lag":384110285,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":25208,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":12,"remaining_lag":25208,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":27564317,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":17,"remaining_lag":27564317,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":1129204,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":33,"remaining_lag":1129204,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":2097282,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":3,"remaining_lag":2097282,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":13572203,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":48,"remaining_lag":13572203,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":27059199,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":24,"remaining_lag":27059199,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":42089815,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":6,"remaining_lag":42089815,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":34853311,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":42,"remaining_lag":34853311,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":32786075,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":18,"remaining_lag":32786075,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":4118450,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":9,"remaining_lag":4118450,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":1091294,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":27,"remaining_lag":1091294,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":35625671,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":15,"remaining_lag":35625671,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":7778183,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":21,"remaining_lag":7778183,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":244108,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":30,"remaining_lag":244108,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":28826000,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":35,"remaining_lag":28826000,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":5606581,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":26,"remaining_lag":5606581,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":42256,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":49,"remaining_lag":42256,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":14122642,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":47,"remaining_lag":14122642,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":7836467,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":32,"remaining_lag":7836467,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":66618221,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":28,"remaining_lag":66618221,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":71230,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":8,"remaining_lag":71230,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":34736684,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":20,"remaining_lag":34736684,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":4035596,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":16,"remaining_lag":4035596,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":14877333,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":36,"remaining_lag":14877333,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":312240,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":40,"remaining_lag":312240,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":20262200,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":4,"remaining_lag":20262200,"time":"2019-10-03T17:05:04Z"} {"consumed_offset":0,"high_water_mark":15816940,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":46,"remaining_lag":15816940,"time":"2019-10-03T17:05:04Z"}

UARedko commented 5 years ago

Thanks for the comment, weeco. It would be good to know if it may be related to a huge message amount and if it may complete reading the offsets after some days. But anyway it seems I have to find another solution for now.

kongzii commented 5 years ago

Same problem here. Minion is stuck at 2 patritions with lag 2 forever. If I try to consume __consumer_offsets with kafkacat, its quick without any problems. Did you find any solution?

{"level":"debug","module":"cluster","msg":"collected topic offsets","time":"2019-10-11T13:06:18Z"}
{"consumed_offset":0,"high_water_mark":2,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":26,"remaining_lag":2,"time":"2019-10-11T13:06:18Z"}
{"consumed_offset":0,"high_water_mark":2,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":3,"remaining_lag":2,"time":"2019-10-11T13:06:19Z"}

weeco commented 5 years ago

@kongzii Do you have any chance to look into partition 26 and partition 3 in the __consumer_offsets topic.

I think I have an idea what could cause the issues here. Probably these have partitions are empty (because messages have been compacted) and don't get any further offsets.

kongzii commented 5 years ago

@weeco This is a fresh installation of Kafka, containing only some hello-world messages. Most of the partitions are empty, but only those two had problem.

Partition - Latest offset (Checked via Yahoo Manager) 2 - 2614 3 - 3 8 - 3 26 - 3 33 - 3

Others have 0, I tried to restart minion and now three partitions have issue:

{"consumed_offset":0,"high_water_mark":2,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":33,"remaining_lag":2,"time":"2019-10-15T06:23:36Z"}
{"consumed_offset":0,"high_water_mark":2,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":26,"remaining_lag":2,"time":"2019-10-15T06:23:37Z"}
{"consumed_offset":0,"high_water_mark":2,"level":"debug","msg":"partition consumer has not caught up the lag yet","partition":3,"remaining_lag":2,"time":"2019-10-15T06:23:37Z"}

But indeed, when I try to consume messages with kafkacat from only partition 3, 26 or 33, I receive nothing:

root@kafka-kafkacat-65788d774b-h4lc7:/# kafkacat -b kafka-bootstrap:9092 -t __consumer_offsets -o beginning -c 100 -p 33 -e
% Auto-selecting Consumer mode (use -P or -C to override)
% Reached end of topic __consumer_offsets [33] at offset 3: exiting

And from some other partitions:

root@kafka-kafkacat-65788d774b-h4lc7:/# kafkacat -b kafka-bootstrap:9092 -t __consumer_offsets -o beginning -c 100 -p 8 -e
% Auto-selecting Consumer mode (use -P or -C to override)
consumerrange/consumer-1-ac8989cb-5042-49da-a04f-32e9234a9a25m???/consumer-1-ac8989cb-5042-49da-a04f-32e9234a9a25??
consumer-1
          /10.244.1.73??'__consumer_offsets?__consumer_offsets2

!"#$%&'()*+,-./01
consumer????m??5?

% Reached end of topic __consumer_offsets [8] at offset 3: exiting
root@kafka-kafkacat-65788d774b-h4lc7:/# kafkacat -b kafka-bootstrap:9092 -t __consumer_offsets -o beginning -c 100 -p 0 -e
% Auto-selecting Consumer mode (use -P or -C to override)
% Reached end of topic __consumer_offsets [0] at offset 0: exiting

vasili439 commented 4 years ago

Same issue:

{"level":"info","msg":"Offets topic has not yet been consumed until the end","time":"2019-11-26T11:40:25Z"}

weeco commented 4 years ago

During the holidays between christmas and new year's eve I'll have some time to investigate and potentially rewrite the logic for that. I am eager to fix your reported problems. Any additional information might be helpful, as I can not reproduce it in any of my own clusters.

ut0mt8 commented 4 years ago

I really love kafka minion but I think that consuming the __consumer_group topic is not the best option to get the lag. As we struggled also with this kind of issues I finally wrote a small exporter to expose the lag in a more naive but reliable way => https://github.com/ut0mt8/yakle. I've added something costly but very convenient to have the lag in term of duration.

If you have any comment feel free.

weeco commented 4 years ago

@ut0mt8 That defeats the whole purpose of Kafka Minion , there are already several (reliable) lag exporters which do not consume the __consumer_offsets topic but use Kafka's admin protocol instead. However I described here why I decided against that approach: https://github.com/cloudworkz/kafka-minion#why-did-you-create-yet-another-kafka-lag-exporter

If you don't need these features https://github.com/danielqsj/kafka_exporter would probably work for you as well.

ut0mt8 commented 4 years ago

kafka-exporter with a forked version work yes. But still what is the advantage of consuming __consumer_offset over using discrete queries ?

weeco commented 4 years ago

More realtime lags
More information (e. g. group member & offset metadata, commit rate per group, last commit activity - which can also be used to calculate expected offset expiries)
More reliable in the sense that you'll still get lags reported during leader elections
Access for previous commits and historical events (this is not relevant for kafka minion but Burrow uses these for a more sophisticated approach determining the health of each consumer by looking at the last x commits)

To name some of the advantages.

ut0mt8 commented 4 years ago

More realtime lags

Querying kafka periodically is less precise ok

More information (e. g. group member & offset metadata, commit rate per group, last commit activity - which can also be used to calculate expected offset expiries)

I think we can have theses infos whatever the method ? at least group member, and commit rate per consumergroup

More reliable in the sense that you'll still get lags reported during leader elections

well ok, but elections are rare, and even it's not really a problem to not have lag during a short period

Access for previous commits and historical events (this is not relevant for kafka minion but Burrow uses these for a more sophisticated approach determining the health of each consumer by looking at the last x commits)

I understand the design but the big problem for me is you need to consume all the partition at start. That can be really long, and also this make your process completely stateful. So you if you are lost in the consumption we need to restart the process. It happen to us several time. This is why I prefer the kafka-exporter approach, after my attempt is just for fun (but it works).

Anyway good job, and see if we can use good ideas from one project to another.

mortenko commented 4 years ago

Hi guys, do you have any updates ? I have the same problem.

ut0mt8 commented 4 years ago

Well I finished to wrote my own exporter... https://github.com/ut0mt8/yakle if you want to test. For the moment it's working well for me plus you got the lag in second. You can give it a try ;)

josepplaltg commented 4 years ago

Maybe it's not your case but, I've experienced the same issue and was just only a matter of the needed ACLs for the kafka-minion consumer, if you're using ACLs, of course. I'm running a big cluster and the client starts pushing metrics in few seconds... after the ACL fix :P

ZVilusinsky commented 4 years ago

Well, this started happening on our dev environment at the end of April. :-/ Minion is stuck on "level":"info","msg":"Offets topic has not yet been consumed until the end"for several days and not showing any sings of finishing. I am kinda worried, that after a certain point of cluster age, Minion can't read the whole offset in a reasonable amount of time. Even on a cluster, thats been up for just a month, the starting time of a fresh Minion was 8 minutes.

weeco commented 4 years ago

@ZVilusinsky I don't think this is an issue with the number of messages. The consumer offsets topic is a compacted topic and a single core for Kafka Minion can process 50-100k messages / second. Even on huge clusters this should only take a few minutes. As this issue indicates there seems to be a problem with single partitions not delivering the "last message".

Please enable debug log messages and invocate the /metrics endpoint after 10 minutes or so. The logs should print how far each partition has been consumed. If you look the Grafana dashboard for kafka minion there's also a Group of "internal metrics" which shows the processing speed / consumed offsets per second.

ZVilusinsky commented 4 years ago

well, in our case it runs about 10-20% of your estimated messages/second Minion runs dockerized with no resource limits on 6 CPU, 68 GB RAM virtual machine.

weeco commented 4 years ago

@ZVilusinsky That's interesting, can you check whether kafka minion consumes all the CPU? If not (which I expect) I assume Kafka throughput is the bottleneck here (could be number of partitions?)

ZVilusinsky commented 4 years ago

After 16 hours Seems I may have isolated the culprit, I found out there is a relatively new topic, that is, according to kafka team "a biggie" . It worked okay until the server was restarted for patching and has to read it. I guess it could result in a loop where there are already new messages before it manages to read the end of __consumer offsets and never starts providing metrics? Still, it managed to start on the new environment... I've pointed the situation to the team and will see.

shyimo commented 4 years ago

We have the same issue as well. our kafka cluster is up & running for about 6 months and we are stuck for almost 2 days with this log: (running the latest version of kafka-minion with kafka 2.2.0) {"level":"info","msg":"Offets topic has not yet been consumed until the end"}

Kumartg commented 4 years ago

Hi, We are also facing this issue and getting {"level":"info","msg":"Offets topic has not yet been consumed until the end"} message.

If we enable it in DEBUG mode, getting the following message -

kafka_minion: {"broker_id":1,"level":"debug","module":"cluster","msg":"received partition high water mark","offset":0,"partition":1,"time":"2019-10-15T06:23:36Z","timestamp":1593837283723,"topic":"test01"}

Minion_build: 1.0.0 Confluent Kafka version : 5.5

$ curl -s http://:/readycheck Offsets topic has not been consumed yet

Can you please help on fixing this issue?

Kumartg commented 4 years ago

@weeco - Can you please help here?

weeco commented 4 years ago

@Kumartg As described above I checked the code multiple times and the logic is correct. I can only imagine that this is a bug in sarama. Or that it can't consume a last message of a partition for some other reasons. Did you try running v1.0.1 (recently released) which contains a newer version of sarama?

I am happy to debug this if I can reproduce the issue or if I get remote access to a Kafka cluster which is affected by this issue though

ut0mt8 commented 4 years ago

With so many reports of people telling you that there are issue consuming __consumer_offsets topics can you still argue that this design choice is good ? at least you can provide the two methods and add a flag.

ZVilusinsky commented 4 years ago

Postmortem from my side. We never manage to solve the issue despite trying many different approaches on kafka side, so in the end our team is writing an inhouse tool to monitor topics. One final note - the limited throughput I was experiencing was the result of DEBUG log level. See included graph, on the right side without debug log, on the left started with debug log.

weeco commented 4 years ago

@ZVilusinsky Thanks for sharing, it looks a bit weird that the messages / second throughput is increasing in the course of time.

Before giving up / writing yoru own exporter you could give https://github.com/cloudhut/kowl a try and see whether we figure out something about your affected partitionIDs which can not be consumed. Do you mind giving this a try? There we can see the low & high watermark and also consume the messages on the consumer offsets partition.

Also you could give https://github.com/linkedin/Burrow a spin which works similarly. You might run into the same issues though. If you do this is likely an issue with sarama or your Kafka cluster.

shyimo commented 4 years ago

@weeco Interesting fact we just found in our investigation of Kafka minion. It's seems like this problem only occurs when the __consumer_offsets topic is extremely unbalanced. we have a situation in one of our clusters that we have 108,958,210 messages for this topic (which has 50 partitions) and 108,482,731 of them are on the same partition. still investigating why but looks like this is the problem in our case

ZVilusinsky commented 4 years ago

@weeco They already wrote a base :) But I'll see if I can find something with kowl after I return from vacation. As for Burrow, we left it for the minion, because it didn't provide lag per topic.

btw: noticed new release and tried it, no luck.

ut0mt8 commented 4 years ago

@ZVilusinsky there are lot of other exporters that provide lag per partition. You can try mine or other which are not bases on consuming __consumer_offsets. The most famous is kafka_exporter. My attempt (which works well @work) yakle

Just a side related to kafka_minion, we still run the exporter in // and we have the same kind of pattern on some topics but not all topics.

ZVilusinsky commented 4 years ago

@weeco I forwarded @shyimo 's observation to our Kafka team and they think that it's our problem too. To quote "most of the traffic is on one topic and thus fills mostly one partition of the [__consumer_offset] topic"

godwhoa commented 4 years ago

For what it's worth, __consumer_offsets's partition assignment is done by hash(group.id) % total_partitions_of_offsets_topic, so a consumer group's offsets will always end up in a single partition even if it is committing for many topics. If you have consumer groups with very different commit frequency you will end up with unbalanced __consumer_offsets partitions.

https://github.com/apache/kafka/blob/87bef1d1b0fe1502be2335524990aad94bfd4bae/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L189

mtrienis commented 4 years ago

I ran into this issue and discovered that you need to double check that the log.cleaner.enable is NOT set to false. If it is, then the __consumer_offsets topic will NOT delete old keys. This is likely why it takes Kafka Minion such a long time to initialize on a long running Kafka cluster.

You'll probably want to double check that configuration option is enabled on ALL Kafka brokers. Note that the option is only applied if you restart the broker.

weeco commented 3 years ago

I just pushed all the code for v2.0.0. This version uses a different Kafka client and a slightly different approach for making sure the consumer offsets topic is completely consumed. If this still doesn't work for you (which I'd like to know), I've added an option to use the adminApi (which is the new default by the way) to scrape consumer group offsets.

Reference config: https://github.com/cloudhut/kminion/blob/master/docs/reference-config.yaml#L53

Feedback appreciated. I'll close the issue though because I suspect it to be fixed.

redpanda-data / kminion

kafka-minion stuck on "Offsets topic has not yet been consumed until the end" for days #15