Closed jbvmio closed 6 years ago
Please try restart controller if you have not.
@xhl1988 - The controller has been restarted, stopped and failed over multiple times.
Have you also tried to restart all workers?
Meanwhile, 1. Do you see any warn/error logs on controller except for the doesn't match
log? 2. when kaf04-wrkr1 is back, any abnormal log?
@xhl1988 - Yes, all worker and controller services were shutdown and restarted during troubleshooting. I am not seeing any abnormalities in the logs - however, the logs have since rolled over so I am not able to comb any deeper.
I was ultimately able to resolve the issue by stopping the worker on kaf04, deleting the worker instances in helix and then restarting the worker.
After the procedure above, validation looks good:
ExternalView:
kaf01-wrkr1: 396
kaf02-wrkr1: 396
kaf03-wrkr1: 396
kaf04-wrkr1: 396
kaf05-wrkr1: 396
IdealState:
kaf01-wrkr1: 396
kaf02-wrkr1: 396
kaf03-wrkr1: 396
kaf04-wrkr1: 396
kaf05-wrkr1: 396
numErrorTopicPartitions: 0
numErrorTopics: 0
numOfflineTopicPartitions: 0
numOnlineTopicPartitions: 1980
numTopicPartitions: 1980
numTopics: 198
This seems pretty involved and still think ureplicator should self recover.
TIA -
Closing issue for now.
How can we restore the idealState in the event of a worker failure? We had a worker go down a few days ago and since then the cluster seems to have started experiencing issues, even once the worker comes back online. External View greatly differs from Ideal State and partitions that have been assigned to the said worker are missing.
Here is validation view:
Here is idealState and externalView (You can see here that partition12 is missing from the ExternalView):
From ZK, if you look at the topics listed under: zk ls /ureplicator/cluster/INSTANCES/kaf04-wrkr1/CURRENTSTATES/3003522eeb78c1a
only 14 of the 196 topics are listed.
Perhaps there is something I'm overlooking. If so, my apologies in advance, but I was under the assumption uReplicator will auto-rebalance or attempt to get things back into the idealState.
Is there anyway to force or kick off the process to restore the idealState?
Thx.