[RFC] OpenSearch ML model artifacts releasing SOP

model-collapse commented 7 months ago

Background:

With the impact growth of OpenSearch ml-commons, we observe that some people are proposing to contribute model artifacts to the community. Meanwhile, some features should come up with its default model artifact such as neural search and neural sparse. With the release of agent framework and more features on search relevance, log analytics and . Now it is necessary to discussion about the SOP (Standard Operation Procedure) of model artifacts releasing.

Parameter storage

Today, most of the deep learning models are having huge amount of parameters, it is impossible to store model parameter file inside the github repos. We encourage the contributors to upload their models to Hugginface or other cloud-hosted OSS and include the file URL in the repo.

Discussion 1: Where shall we release the models?

Currently Opensearch team is releasing our model artifacts in one page on document site and the model artifacts are only about neural search and neural sparse. We need a better place to publish those model artifacts as well as other materials about the model.

Option 1:

Adding one folder in ml-commons repo one in openserach-skills and one in neural. Each model will have a dedicated sub directory. For example: ml-commons.git ┣client ┣ ml-algorithms ┣ .... ┗ models ┣tool-selection ┣ inference.py ┣ tool-selection.pynb ┗ .... ┗ index-selection ┣ inference.py ┣ tool-selection.pynb

Pros:

This approach will encourage the owners to be more active in collecting & maintaining the model artifacts related with their own features. Meanwhile, the 3rdparty contributors will be easier to find the right to support when releasing the model artifacts.
For features coming with their default models, the integration test can easily access the model artifacts and its related scripts/configurations.

Cons:

Since most of the repo owners are engineers, with limited knowledge to ML science. This approach will bring them extra burdens and the quality of model contributions support is also at risk.
All of the 3 above repos are following the version release loop of OpenSearch, which means the models should be released along with each OpenSearch release, but intuitively, the model artifact release can be separate.

Option 2:

Creating a new repository hold those model artifacts. All opensearch related ML models should be gathered in this repo. Each model is hosted in sub directory under the repo's root. Each sub directory should have a README.md, describing the model's usage, especially which OpenSearch module it is working with.

Pros

After the nomination of the repo owner, who is with science background, this person can provide professional support to the 3rdparty/OpenSearch contributors.
Gathering all the models in one repo can have more concrete public impression, increasing the OpenSearch's reputation in the ML domain, attracting more contributors.
The repo need not follow the OpenSearch's release loop, the models can be published anytime.

Cons

If some OpenSearch repo needs to deploy the model or using the script in this new repo, they will have dependency to this repo, however, this repo is not following the release loop.
Once ml-commons and neural-search made some refactor and the model input/output format is changed. The related models will be sometimes impacted, and the other repos may ignore the ITs to this kind of case.

The SOP:

1, The contributor should open an issue in the related repo, describing what the model artifact will be look like and which feature the model artifacts will be supporting. 2, If not major concern from the community, PRs can be pushed to the main branch. In the PR description, following things should be provided: a. list of files, having the description to each file. The reviewers should carefully review the model execution scripts, especially on security. b. The benchmark result showing the comparison to other baseline models. The maintainer should invite a science reviewer to check the benchmark result as another review to the PR. 3, After the PR is merged, the opensearch team will start the process of updating the model release page of the documentation website if needed.

Suggested & Minimum Requirements:

For each model, we are holding a minimum bar on the release artifacts and giving our suggested artifacts.

Online Deployed Models

For those online deployed model, no matter it will be deployed inside OpenSearch cluster or a 3rdparty inference endpoint:	Artifact	Mandatory?
README.md	✅	Should include: Descriptions, OS Mod to support, Source repo URL (if there is), Benchmark result, example API calls of ml-commons deploy or remote connector creation
parameters.json	✅	A file contains the URL to the model parameter file. Format {"model name": "URL"}
demo.pynb	✅	A code-with-tutorial jupyter notebook on how to play with the model
deploy-local.py	✅	A script to deploy the model on one single node and serve the model as a endpoint
deploy-.py	-	Scripts to deploy the model on 3rd party platforms

Offline Deployed Models

For those models which just executed offline, no deployment of endpoint is required:	Artifact	Mandatory?
README.md	✅	Should include: Descriptions, OS Mod to support, Source repo URL (if there is), Benchmark result
parameters.json	✅	A file contains the URL to the model parameter file. Format {"model name": "URL"}
demo.pynb	✅	A code-with-tutorial jupyter notebook on how to play with the model

Discussion 2: When publishing a feature RFC, should the issue of its model release be opened at the same time?

Option 1: Yes, and the model issue should be linked with the feature issue.

Pros: If the audience of the feature RFC is curious about the model, they can track the model release as well. Cons: More workload to write duplicate context for both issues. Meanwhile, describing on thing in two places will lead bring information loss.

Option 2: No, just include the model information in the feature RFC.

Pros: Less workload, the readers can only read on issue and won't miss anything. Cons: No place to track all the model releases together. When reading the issue, engineering guys may get lost if there's to much sciencific descriptions to the model.

dhrubo-os commented 7 months ago

Currently we release models in the release team's S3 bucket using opensearch-py-ml's model release workflow

Could you please add another point describing why we can't extend our existing workflow to support all our need.

Currently Opensearch team is releasing our model artifacts in one page on document site and the model artifacts are only about neural search and neural sparse. We need a better place to publish those model artifacts as well as other materials about the model.

This is not completely correct. Opensearch release team is pushing us to create/upgrade our automated model publishing workflow so that we don't need to depend on release team. So in theory we can easily enhance this workflow in an automated fashion so that we can easily release model based on customer need.

xinyual commented 7 months ago

Could you please add another point describing why we can't extend our existing workflow to support all our need.

From my understanding, current workflow could only publish model weights. But sometimes we need to publish some code for training and inference, as python script or jupyter notebook. Current workflow could not do so. But I notice we already have some jupyter examples here https://github.com/opensearch-project/opensearch-py-ml/tree/main/docs/source/examples. Can we directly use this repo for option 1?

opensearch-project / ml-commons