teragrep / cfe_39

HDFS Data Ingestion for PTH_06 use
GNU Affero General Public License v3.0
0 stars 3 forks source link

Implement kafka ConsumerRebalanceListener interface in a way that it allows tracking of partition offsets inside a batch. #42

Open Tiihott opened 2 weeks ago

Tiihott commented 2 weeks ago

Description By registering a ConsumerRebalanceListener to a consumer in the consumer group, the listener can be used for tracking the record offsets inside a batch. The listener must be initialized after consumer has been initialized but before the consumer has subscribed to a topic. The listener object must also be passed to BatchDistributionImpl object for when it starts processing the record batch, allowing the listener to track the offsets that have been stored to HDFS. Once a kafka rebalance happens, the listener is used to clean up any remaining records from previous batches and store them to HDFS. Offset commits are also updated during this process, so the new consumer that has been assigned to the topic after rebalance knows where to start reading the topic.

Tiihott commented 2 weeks ago

In this use-case another more practical way to achieve the proper tracking of consumed offsets is to leverage the idempotent consumer implementation, which stores the consumed offset data to the HDFS filenames. Thus it would be possible to implement the listener in this way solving the issue:

  1. During rebalance the listener's onPartitionsRevoked(Collection partitions) method will first flush the consumer's cache of intermediate results to HDFS and clean up any local files in preparation of new partition assignment.
  2. Afterwards onPartitionsAssigned(Collection partitions) method will source the latest offsets from HDFS where the records were stored using the same method that idempotent consumer implementation is using.
Tiihott commented 6 days ago

Solved in beta branch PR #41