uber / uReplicator

Improvement of Apache Kafka Mirrormaker
Apache License 2.0
917 stars 199 forks source link

Restore IdealState and Rebalance #153

Closed jbvmio closed 6 years ago

jbvmio commented 6 years ago

How can we restore the idealState in the event of a worker failure? We had a worker go down a few days ago and since then the cluster seems to have started experiencing issues, even once the worker comes back online. External View greatly differs from Ideal State and partitions that have been assigned to the said worker are missing.

Here is validation view:

ExternalView:
  kaf01-wrkr1: 392
  kaf02-wrkr1: 392
  kaf03-wrkr1: 392
  kaf04-wrkr1: 39
  kaf05-wrkr1: 392
IdealState:
  kaf01-wrkr1: 392
  kaf02-wrkr1: 392
  kaf03-wrkr1: 392
  kaf04-wrkr1: 392
  kaf05-wrkr1: 392
numErrorTopicPartitions: 0
numErrorTopics: 138
numOfflineTopicPartitions: 39
numOnlineTopicPartitions: 1568
numTopicPartitions: 1960
numTopics: 196

Here is idealState and externalView (You can see here that partition12 is missing from the ExternalView):

  ︎︎︎▲ idealState︎ ▼ |
  0              | ["kaf05-wrkr1"]
  1              | ["kaf01-wrkr1"]
  10             | ["kaf02-wrkr1"]
  11             | ["kaf01-wrkr1"]
  12             | ["kaf04-wrkr1"]
  13             | ["kaf03-wrkr1"]
  14             | ["kaf03-wrkr1"]
  2              | ["kaf05-wrkr1"]
  3              | ["kaf03-wrkr1"]
  4              | ["kaf03-wrkr1"]
  5              | ["kaf01-wrkr1"]
  6              | ["kaf05-wrkr1"]
  7              | ["kaf05-wrkr1"]
  8              | ["kaf03-wrkr1"]
  9              | ["kaf02-wrkr1"]

  ︎︎︎▲ externalView︎ ▼ |
  0                | ["kaf05-wrkr1"]
  1                | ["kaf01-wrkr1"]
  10               | ["kaf02-wrkr1"]
  11               | ["kaf01-wrkr1"]
  13               | ["kaf03-wrkr1"]
  14               | ["kaf03-wrkr1"]
  2                | ["kaf05-wrkr1"]
  3                | ["kaf03-wrkr1"]
  4                | ["kaf03-wrkr1"]
  5                | ["kaf01-wrkr1"]
  6                | ["kaf05-wrkr1"]
  7                | ["kaf05-wrkr1"]
  8                | ["kaf03-wrkr1"]
  9                | ["kaf02-wrkr1"]

From ZK, if you look at the topics listed under: zk ls /ureplicator/cluster/INSTANCES/kaf04-wrkr1/CURRENTSTATES/3003522eeb78c1a

only 14 of the 196 topics are listed.

Perhaps there is something I'm overlooking. If so, my apologies in advance, but I was under the assumption uReplicator will auto-rebalance or attempt to get things back into the idealState.

Is there anyway to force or kick off the process to restore the idealState?

Thx.

xhl1988 commented 6 years ago

Please try restart controller if you have not.

jbvmio commented 6 years ago

@xhl1988 - The controller has been restarted, stopped and failed over multiple times.

xhl1988 commented 6 years ago

Have you also tried to restart all workers?

Meanwhile, 1. Do you see any warn/error logs on controller except for the doesn't match log? 2. when kaf04-wrkr1 is back, any abnormal log?

jbvmio commented 6 years ago

@xhl1988 - Yes, all worker and controller services were shutdown and restarted during troubleshooting. I am not seeing any abnormalities in the logs - however, the logs have since rolled over so I am not able to comb any deeper.

I was ultimately able to resolve the issue by stopping the worker on kaf04, deleting the worker instances in helix and then restarting the worker.

After the procedure above, validation looks good:

ExternalView:
  kaf01-wrkr1: 396
  kaf02-wrkr1: 396
  kaf03-wrkr1: 396
  kaf04-wrkr1: 396
  kaf05-wrkr1: 396
IdealState:
  kaf01-wrkr1: 396
  kaf02-wrkr1: 396
  kaf03-wrkr1: 396
  kaf04-wrkr1: 396
  kaf05-wrkr1: 396
numErrorTopicPartitions: 0
numErrorTopics: 0
numOfflineTopicPartitions: 0
numOnlineTopicPartitions: 1980
numTopicPartitions: 1980
numTopics: 198

This seems pretty involved and still think ureplicator should self recover.

TIA -

jbvmio commented 6 years ago

Closing issue for now.