ntent / kafka4net

C# client for Kafka
Apache License 2.0
52 stars 32 forks source link

Producer stops working after manual partition reassignment #27

Closed vikmv closed 8 years ago

vikmv commented 8 years ago

Noticed, that kafka producer stops working after running "kafka-reassign-partitions"

Here is excerpt from debug log:

Sending ProduceRequest to Connection: <broker1>, Request: Broker: <broker1> Id:1 Acks: 1 Timeout: 1000 Topics: [Topic: <topicname> Parts: [Part: 1 Messages: 1]] | 
Got ProduceResponse: Topics: [<topicname> [Part: 1, Err: NoError Offset: 16122593985]] | 

after running reassignment, partition leader was changed, but producer did not refreshed metadata and still trying to send message to invalid broker

Sending ProduceRequest to Connection: <broker1>, Request: Broker: <broker1> Id:1 Acks: 1 Timeout: 1000 Topics: [Topic: <topicname> Parts: [Part: 1 Messages: 1]] | 
Got ProduceResponse: Topics: [<topicname> [Part: 1, Err: UnknownTopicOrPartition Offset: -1]] |
..(and the same message repeats then)..

I'll try to reproduce it with unit test and think about fixing it. It seems that fix would be to change IsPermanentFailure method so that it would treat "UnknownTopicOrPartition" as recoverable error, but I'm not 100% sure yet. What were the reasons to treat UnknownTopicOrPartition error as a permanent? (while "NotLeaderForPartition" is not permanent)

I believe UnknownTopicOrPartition & NotLeaderForPartition should be assigned to the same category. Because after partition reassignment there can be such situations:

vchekan commented 8 years ago

Hmm, UnknownTopicOrPartition can be caused by kafka configuration where automatic topic creation is disabled. In this case interpreting it as an recoverable error is undesirable. I do expect NotLeaderForPartition in case of partition reassignment, but getting UnknownTopicOrPartition is strange.

There is a test ConsumerFollowsRebalancingPartitions but seems producer is not tested under this condition...

vikmv commented 8 years ago

I'm writing unit test for this case right now. Will share it soon.

vikmv commented 8 years ago

Successfully reproduced it with a test and created a pull request (with test & fix). According to the Kafka protocol guide, error UnknownTopicOrPartition(3) is retriable, so it shouldn't be classified as permanent

vchekan commented 8 years ago

Thanks Viktor, looking good at the first glance. I'll take a closer look and will commit in a couple days.

vikmv commented 8 years ago

Fixed in #28