opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
88 stars 122 forks source link

[FEATURE] Support update connector without undeploying the model #2496

Open zane-neo opened 1 month ago

zane-neo commented 1 month ago

Is your feature request related to a problem? Currently the update connector API checks all the usage of it and only when no model using it, then update operation can go through, but this doesn't seem reasonable, especially for remote models.

  1. When remote model deploys, it's an object creation and put into a map: https://github.com/opensearch-project/ml-commons/blob/main/plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java#L1144
  2. When connector information changed, the connector information and cached model info can be retrieved, then updating the model in cache should able to redeploy the model with new connector info.

What solution would you like? Adding a new parameter like redeploy_model=true in the url param can reduce the manual effort to undeploy/deploy the model.

What alternatives have you considered? Change the default behavior to automatically redeploy the model after connector updated.

Do you have any additional context? Add any other context or screenshots about the feature request here.

ylwu-amzn commented 1 month ago

@b4sjoo Has built update connector API which doesn't need redeploy model for internal connector. Sicheng, can you help take this to support standalone connector too?

Zhangxunmt commented 1 month ago

With auto deploy for remote models, this can be easily done as follow:

  1. Update the connector and save the new connector meta into the ml-connector index. (already in the current API)
  2. Undeploy the models that are associated with the connector. (single line change)

After UpdateConnector is done and when the connector is used by any model, the model will be auto-deployed with the updated connector. People may ask why my model is un-deployed once the connector is updated? Because you have a very important metadata updated which means you model has changed so we un-deployed your model. However, it doesn't introduce any downtime or disturb how you can use your model. From the users point of view, the availability and usability remain the same.

Related: https://github.com/opensearch-project/ml-commons/issues/1148, https://github.com/opensearch-project/ml-commons/issues/2376

zane-neo commented 1 month ago

With auto deploy for remote models, this can be easily done as follow:

  1. Update the connector and save the new connector meta into the ml-connector index. (already in the current API)
  2. Undeploy the models that are associated with the connector. (single line change)

After UpdateConnector is done and when the connector is used by any model, the model will be auto-deployed with the updated connector. People may ask why my model is un-deployed once the connector is updated? Because you have a very important metadata updated which means you model has changed so we un-deployed your model. However, it doesn't introduce any downtime or disturb how you can use your model. From the users point of view, the availability and usability remain the same.

Related: #1148, #2376

Is there any possibility that the model's un-deploy and auto deploy happens at the same time causing any unexpected status?

Zhangxunmt commented 1 month ago

The "unexpected status" is too general so it's hard to imagine all edge cases or racing conditions to happen. We need to state it clearly that it's not recommended to predict a model when you are in middle of updating the connector. Before the un-deploy finishes, auto-deploy will not happen because old models are still in the memory so predictions are still based on the old model if you predict a model while updating the connector.

zane-neo commented 1 month ago

That doesn't seem a good user experience, if user is updating some http client related parameters, e.g. connection timeout, the good user experience would be: the prediction can keep happening and for the instances that received this update, the afterward predictions honer the updated connection timeout setting, for the instances that haven't received this update, the predictions honer the old connection timeout setting. The thing is currently updating connector can cause data loss on production since it requires to un-deploy the model, so our purpose should be avoiding this to give user seamlessly experience, without data loss.

Zhangxunmt commented 1 month ago

It will not have data loss. The auto-deploy will refresh the new connector for you with updated params. It only may introduce data inconsistency in a short time window.