Open Tiihott opened 4 months ago
Replicated PriorityParseException and NullPointerException in local tests and traced the underlying issue of the exceptions. They seem to be separate issues from the RuntimeException. Fixes for PriorityParseException and NullPointerException in PR #35.
Moved NullPointerException handling to record processing, so kafka consumer offset control won't be affected by the null record handling. In other words, null records are properly consumed and marked as committed but not processed and stored to HDFS.
The RuntimeException issue origin was tracked to incomplete user mapping for kerberos, which allowed the creation of the 18.717496 file but failed at writing the contents of the file. Because writing of the file contents failed and caused an exception, the offsets for consumed kafka records were not committed for the records that were being processed. As the first consumer failed, the consumer group rebalanced and the second consumer thread tried to consume and process the same records again but this time failed at the earlier HDFS file creation stage because there already existed an empty file with the same name in HDFS.
Describe the bug When ingesting records from kafka (not mock kafka consumer) the same set of records are consumed twice causing exception of trying to store the same file twice to HDFS.
Expected behavior Records should only be consumed once from Kafka, and thus storing the same AVRO-file and set of records only once to HDFS.
How to reproduce Default configuration with kerberized HDFS access and mock kafka consumer disabled.
Software version beta 0.2.0
Additional context