Closed deric closed 7 years ago
Is there an executor running on the slave (s8 in this case) still for that task? Otherwise mesos shouldn't have it in STAGING anymore (it should have transitioned to TASK_LOST or TASK_FAILED).
Also, what version of kafka-mesos are you running? As well as your mesos version. It looks like an older version of the framework, 0.10.0 might handle this, but a quick glance looks like it might not handle TASK_STAGING for an unknown broker either (relevant code here ).
Yeah you're right, Mesos should remove the task. Mesos agent on that node has been restarted and the old task was not removed. We're running on:
mesos 1.1.0-2.0.107.debian81
kafka 0.9.0.1
kafka-mesos 0.9.5.1
We're using the latest stable version. The relevant code doesn't seem to handle such case.
After restarting Mesos master the STAGING task was removed (probably we should leave this to Mesos). I've restarted kafka scheduler, but it's still in quite strange state. Any attempt for rebalance fails:
Error: java.io.IOException: 400 - rebalance is already running
Is this possible to fix without shutting down whole kafka cluster?
You'll need to figure out why your rebalance is stuck (check out the kafka controller's logs).
If also else fails you can delete the rebalance znode in zookeeper and restart whichever node is the current controller to force stop it.
Ok, thanks. I ended up with deleting znode
delete /kafka/admin/reassign_partitions
and now rebalance is working again. I'm closing this task because the case should be handled by Mesos master.
After one broker crashed I had to manually remove broker and add it again to cluster. However mesos-kafka still thinks that the task is STAGING:
After restarting the scheduler there's still zombie kafka broker:
Is it possible to remove (or try restarting) such stalled task? Probably after some timeout.