Open gaurav7830 opened 1 month ago
@Zhangxunmt I know you have some suggestion to enhance this part. Please help take a look.
2976
This PR is to remove the remote model auto redeploy during cluster change, it doesn't mean this issue is caused by model auto redeploy, in fact, the root cause of why the model stuck in deploying status is still unknown since it's very difficult to reproduce. The real solution for this issue is to support model undeploy when model status is deploying which will be implemented very soon, user can use this feature to undeploy the model and redeploy again to mitigate the pain.
The root cause is when deploying the model, manager node sends out the deploy request to all eligible nodes in the cluster, but a node can crash at any moment, if it crashed right after the getEligibleNodes method ran, that node won’t send deploy response to manager node. The worker node won’t be count down to 0, so the model status won’t be updated and keeps at deploying status.
To reproduce this issue, you need a small cluster with at least 3 nodes, one is manager node and others are data nodes. Start the manager node and one data node first, create a model and deploy, then start another data node, add debug breakpoint to deploy transport action on manager node(after getting all eligible node), when the debug triggered, shut down the first data node and continue the debug. Then you’ll see the model keeps at deploying status.
@rbhavna Can you update the solution details that will be used to fix this?
What is the bug? Model is getting stuck in deploying state while registering it on the cluster. We have seen cases where the model is not found on the few nodes.
Scenario
What is the expected behavior? Model should be undeployed.