opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
96 stars 135 forks source link

[BUG] Model is getting stuck in deploying state #2970

Open gaurav7830 opened 1 month ago

gaurav7830 commented 1 month ago

What is the bug? Model is getting stuck in deploying state while registering it on the cluster. We have seen cases where the model is not found on the few nodes.

Scenario

  1. Model stuck in DEPLOYING state.
  2. Call model undeploy api on the cluster returning the following response.
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "undeployed"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "undeployed"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "undeployed"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    }
    }
  3. Called GetModel api and it returning model state as DEPLOYING.

What is the expected behavior? Model should be undeployed.

ylwu-amzn commented 1 month ago

@Zhangxunmt I know you have some suggestion to enhance this part. Please help take a look.

mingshl commented 1 month ago

https://github.com/opensearch-project/ml-commons/pull/2976

zane-neo commented 1 month ago

2976

This PR is to remove the remote model auto redeploy during cluster change, it doesn't mean this issue is caused by model auto redeploy, in fact, the root cause of why the model stuck in deploying status is still unknown since it's very difficult to reproduce. The real solution for this issue is to support model undeploy when model status is deploying which will be implemented very soon, user can use this feature to undeploy the model and redeploy again to mitigate the pain.

zane-neo commented 1 month ago

The root cause is when deploying the model, manager node sends out the deploy request to all eligible nodes in the cluster, but a node can crash at any moment, if it crashed right after the getEligibleNodes method ran, that node won’t send deploy response to manager node. The worker node won’t be count down to 0, so the model status won’t be updated and keeps at deploying status.

To reproduce this issue, you need a small cluster with at least 3 nodes, one is manager node and others are data nodes. Start the manager node and one data node first, create a model and deploy, then start another data node, add debug breakpoint to deploy transport action on manager node(after getting all eligible node), when the debug triggered, shut down the first data node and continue the debug. Then you’ll see the model keeps at deploying status.

zane-neo commented 1 month ago

@rbhavna Can you update the solution details that will be used to fix this?