[BUG] Orphaned ML Models

dtaivpp commented 1 year ago

What is the bug? When a model upload fails there is an issue where the model still exists but it cannot be removed/unloaded.

How can one reproduce the bug? Steps to reproduce the behavior:

Upload a model on (what I guess is an unsupported platform?) M1 macbook.

POST /_plugins/_ml/models/_upload 
{
"name": "TEST_MODEL",
"version": "1.0.0",
"description": "MiniLM",
"model_format": "TORCH_SCRIPT",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "sentence_transformers"
},
"url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}

This creates a Task ID that can be tracked. Viewing the task id it has clearly failed.

The above outputs indicates the TaskID we had been given is actually the ModelID. Attempting to _unload the model with what I believe is really the TaskID yields the following:

We cannot load a new model with that name however as it is still in the system:

What is the expected behavior? If a model creation fails I expect to either be able to delete the failed upload or to upload a new model with the same name without issues.

What is your host/environment?

OS: MacOS (M1)
Version 2.9

dhrubo-os commented 1 year ago

There are few confusions. I'm going to clarify all those one by one.

Model content changed issue: The post request object you used is not complete. To maintain security, we introduced model_content_hash_value which is needed to provide in the post request. During model registration, internally we calculate the hash value of the model file and then compare with the given hash value to make sure there wasn't any kind of security breach happened during the uploading. For the right post request, please check this example

Also if you want to upload any pre-trained model, you can check this page

About We cannot load a new model with that name however as it is still in the system: :

In 2.8, we introduced model group.

Ideal scenario is: We create a model group and then provide the model group id during the model registration.

To make it flexible for customer, we kept the model group id optional during the model registration. So in model upload/registration request, if there's no model group id is provided, then internally we create a model group with the same name of the model and put that model under that group. We expect model name to be unique.

So in this case what happened is, first time when you uploaded a model it created a model group with the same model name and then when you tried to upload the same model again, it tried to create another model group with the same name and threw that error because model group name is unique.

Model group related documentation:

I hope this clarifies the confusion.

saratvemulapalli commented 1 year ago

@dhrubo-os it makes sense. But from the error root cause, it really doesn't explain the problem. We should update our error message to say "conflict with model group ID, please retry with a different model group". This would be helpful.

Couple of questions:

If we implicitly create a model group, is there anyway we could notify the user that we created a model group with name xyz (assuming the API is async). I believe it would be a better customer experience.
When the model _upload failed, shouldn't we clean up the model group we created ? Ideally that would be the cleanest experience. For the user it feels like there is lingering context which they dont know :/.

dhrubo-os commented 1 year ago

@saratvemulapalli Yeah I completely agree with this. We tried to improve the error message in this PR

When the model _upload failed, shouldn't we clean up the model group we created ? Ideally that would be the cleanest experience. For the user it feels like there is lingering context which they dont know :/.

Yes, I agree we should clean up the model group is the model is not registered successfully. @rbhavna we don't do this now, right?

dtaivpp commented 1 year ago

Okay, afk but I will give this thread a thorough read to see. In addition if model groups are getting implicitly created we should update the documentation on this page:

https://opensearch.org/docs/latest/ml-commons-plugin/ml-framework/

I had been following that thinking it was a complete set of documentation which is a bit misleading. Perhaps we should rethink how this documentation is broken up. Happy to put a proposal out there since I am working through this at the moment.

ylwu-amzn commented 1 year ago

@dtaivpp I think it can help a lot if you can help improving the document. You can cut issue here https://github.com/opensearch-project/documentation-website/issues

dtaivpp commented 3 months ago

Closing this as a majority of the issues were tied to the error message.

opensearch-project / ml-commons

[BUG] Orphaned ML Models #1179