Open model-collapse opened 7 months ago
Currently we release models in the release team's S3 bucket using opensearch-py-ml's model release workflow
Could you please add another point describing why we can't extend our existing workflow to support all our need.
Currently Opensearch team is releasing our model artifacts in one page on document site and the model artifacts are only about neural search and neural sparse. We need a better place to publish those model artifacts as well as other materials about the model.
This is not completely correct. Opensearch release team is pushing us to create/upgrade our automated model publishing workflow so that we don't need to depend on release team. So in theory we can easily enhance this workflow in an automated fashion so that we can easily release model based on customer need.
Could you please add another point describing why we can't extend our existing workflow to support all our need.
From my understanding, current workflow could only publish model weights. But sometimes we need to publish some code for training and inference, as python script or jupyter notebook. Current workflow could not do so. But I notice we already have some jupyter examples here https://github.com/opensearch-project/opensearch-py-ml/tree/main/docs/source/examples. Can we directly use this repo for option 1?
Background:
With the impact growth of OpenSearch ml-commons, we observe that some people are proposing to contribute model artifacts to the community. Meanwhile, some features should come up with its default model artifact such as neural search and neural sparse. With the release of agent framework and more features on search relevance, log analytics and . Now it is necessary to discussion about the SOP (Standard Operation Procedure) of model artifacts releasing.
Parameter storage
Today, most of the deep learning models are having huge amount of parameters, it is impossible to store model parameter file inside the github repos. We encourage the contributors to upload their models to Hugginface or other cloud-hosted OSS and include the file URL in the repo.
Discussion 1: Where shall we release the models?
Currently Opensearch team is releasing our model artifacts in one page on document site and the model artifacts are only about neural search and neural sparse. We need a better place to publish those model artifacts as well as other materials about the model.
Option 1:
Adding one folder in ml-commons repo one in openserach-skills and one in neural. Each model will have a dedicated sub directory. For example: ml-commons.git ┣client ┣ ml-algorithms ┣ .... ┗ models ┣tool-selection ┣ inference.py ┣ tool-selection.pynb ┗ .... ┗ index-selection ┣ inference.py ┣ tool-selection.pynb
Pros:
Cons:
Option 2:
Creating a new repository hold those model artifacts. All opensearch related ML models should be gathered in this repo. Each model is hosted in sub directory under the repo's root. Each sub directory should have a README.md, describing the model's usage, especially which OpenSearch module it is working with.
Pros
Cons
The SOP:
1, The contributor should open an issue in the related repo, describing what the model artifact will be look like and which feature the model artifacts will be supporting. 2, If not major concern from the community, PRs can be pushed to the main branch. In the PR description, following things should be provided: a. list of files, having the description to each file. The reviewers should carefully review the model execution scripts, especially on security. b. The benchmark result showing the comparison to other baseline models. The maintainer should invite a science reviewer to check the benchmark result as another review to the PR. 3, After the PR is merged, the opensearch team will start the process of updating the model release page of the documentation website if needed.
Suggested & Minimum Requirements:
For each model, we are holding a minimum bar on the release artifacts and giving our suggested artifacts.
Online Deployed Models
Offline Deployed Models
Discussion 2: When publishing a feature RFC, should the issue of its model release be opened at the same time?
Option 1: Yes, and the model issue should be linked with the feature issue.
Pros: If the audience of the feature RFC is curious about the model, they can track the model release as well. Cons: More workload to write duplicate context for both issues. Meanwhile, describing on thing in two places will lead bring information loss.
Option 2: No, just include the model information in the feature RFC.
Pros: Less workload, the readers can only read on issue and won't miss anything. Cons: No place to track all the model releases together. When reading the issue, engineering guys may get lost if there's to much sciencific descriptions to the model.