kafka.errors.CommitFailedError: / max_poll_records

dyerrington commented 3 years ago

I have a vanilla Kafka setup in Docker, and a really basic worker connected to it. The problem I find is after 6-8 message are processed, this message pops up in the logs:


[INFO] Job 5874a4627987449e85d6e974ac4b7a7f returned: None
Traceback (most recent call last):
  File "worker.py", line 39, in <module>
    worker.start()
  File "/opt/conda/envs/bidinterpreter/lib/python3.8/site-packages/kq/worker.py", line 260, in start
    self._consumer.commit()
  File "/opt/conda/envs/bidinterpreter/lib/python3.8/site-packages/kafka/consumer/group.py", line 526, in commit
    self._coordinator.commit_offsets_sync(offsets)
  File "/opt/conda/envs/bidinterpreter/lib/python3.8/site-packages/kafka/coordinator/consumer.py", line 518, in commit_offsets_sync
    raise future.exception # pylint: disable-msg=raising-bad-type
kafka.errors.CommitFailedError: CommitFailedError: Commit cannot be completed since the group has already
            rebalanced and assigned the partitions to another member.
            This means that the time between subsequent calls to poll()
            was longer than the configured max_poll_interval_ms, which
            typically implies that the poll loop is spending too much
            time message processing. You can address this either by
            increasing the rebalance timeout with max_poll_interval_ms,
            or by reducing the maximum size of batches returned in poll()
            with max_poll_records.

I've played with max_poll_interval_ms and max_poll_records but it seems like no matter how I set these, the problem always comes back. Any help appreciated!

dyerrington commented 3 years ago

If anyone else is searching for this from the web, I found this particularly helpful: https://stackoverflow.com/questions/61207633/how-to-process-a-kafka-message-for-a-long-time-4-60-mins-without-auto-commit

The average amount of time it takes for a message to be processed (document processing), is about 15 seconds. I can process about 5 messages before it throws this error. After increasing max_poll_interval_ms to 10 minutes, I still have this problem. I'll post any updates but perhaps this is more to do with the consumer than the worker?

joowani commented 3 years ago

Hi @dyerrington,

kq is a thin wrapper on top of kafka-python. You might have better luck bringing this up with them. Thanks.

joowani commented 3 years ago

Hi @dyerrington,

Just checking if this is still an issue. Any luck?

dyerrington commented 3 years ago

Thank you for pinging for follow-up @joowani. After a weeks-long investigation, I've determined that there is something specific to the Dockerized version of Kafka that I'm using that is different from the version in my development environment which I installed manually. After some time, I learned that setting max_poll_interval_ms to 2x my maximum document processing time on the consumer side of things actually was the right place to use this.

Originally I was looking at this setting from the server-side and not the worker/consumer side of the problem. As of yesterday, I haven't had these problems turn up since making this adjustment in my production systems.

For the record, and if it helps anyone else in the future, I'm using @wurstmeister's compose setup in production and it works great: https://github.com/wurstmeister/kafka-docker

joowani commented 3 years ago

Awesome. Glad you got it resolved! Closing.

joowani / kq

kafka.errors.CommitFailedError: / max_poll_records #15