Closed dtaivpp closed 3 months ago
There are few confusions. I'm going to clarify all those one by one.
Model content changed
issue:
The post request object you used is not complete. To maintain security, we introduced model_content_hash_value
which is needed to provide in the post request. During model registration, internally we calculate the hash value of the model file and then compare with the given hash value to make sure there wasn't any kind of security breach happened during the uploading. For the right post request, please check this exampleAlso if you want to upload any pre-trained model, you can check this page
We cannot load a new model with that name however as it is still in the system:
:In 2.8, we introduced model group.
Ideal scenario is: We create a model group and then provide the model group id during the model registration.
To make it flexible for customer, we kept the model group id optional during the model registration. So in model upload/registration request, if there's no model group id is provided, then internally we create a model group with the same name of the model and put that model under that group. We expect model name to be unique.
So in this case what happened is, first time when you uploaded a model it created a model group with the same model name and then when you tried to upload the same model again, it tried to create another model group with the same name and threw that error because model group name is unique.
Model group related documentation:
I hope this clarifies the confusion.
@dhrubo-os it makes sense. But from the error root cause, it really doesn't explain the problem. We should update our error message to say "conflict with model group ID, please retry with a different model group". This would be helpful.
Couple of questions:
xyz
(assuming the API is async). I believe it would be a better customer experience. _upload
failed, shouldn't we clean up the model group we created ? Ideally that would be the cleanest experience. For the user it feels like there is lingering context which they dont know :/. @saratvemulapalli Yeah I completely agree with this. We tried to improve the error message in this PR
When the model _upload failed, shouldn't we clean up the model group we created ? Ideally that would be the cleanest experience. For the user it feels like there is lingering context which they dont know :/.
Yes, I agree we should clean up the model group is the model is not registered successfully. @rbhavna we don't do this now, right?
Okay, afk but I will give this thread a thorough read to see. In addition if model groups are getting implicitly created we should update the documentation on this page:
https://opensearch.org/docs/latest/ml-commons-plugin/ml-framework/
I had been following that thinking it was a complete set of documentation which is a bit misleading. Perhaps we should rethink how this documentation is broken up. Happy to put a proposal out there since I am working through this at the moment.
@dtaivpp I think it can help a lot if you can help improving the document. You can cut issue here https://github.com/opensearch-project/documentation-website/issues
Closing this as a majority of the issues were tied to the error message.
What is the bug? When a model upload fails there is an issue where the model still exists but it cannot be removed/unloaded.
How can one reproduce the bug? Steps to reproduce the behavior:
This creates a Task ID that can be tracked. Viewing the task id it has clearly failed.
The above outputs indicates the TaskID we had been given is actually the ModelID. Attempting to
_unload
the model with what I believe is really the TaskID yields the following:We cannot load a new model with that name however as it is still in the system:
What is the expected behavior? If a model creation fails I expect to either be able to delete the failed upload or to upload a new model with the same name without issues.
What is your host/environment?