[RFC] OpenSearch ML Extensibility in ml-commons plugin (remote inference)

Zhangxunmt commented 1 year ago

Problem Statement

ML Commons for OpenSearch eases the development of machine learning features by providing a set of common machine learning (ML) algorithms through transport and REST API calls. Currently ML Commons support several build-in models like KNN, Linear Regression, etc and custom models uploaded by users.

As a complement to the current ML model serving framework, we want to allow customers to use their choice of ML tech like OpenAI, Amazon SageMaker Hosting, Kubeflow KServe, Tensorflow Serving and NVIDIA’s Triton Inference Serve, and empower ML technology providers to integrate their technology with OpenSearch via a low-to-no-code experience and join an open ecosystem that empowers builders to create AI-powered apps faster.

We are trying to resolve the following problems.

Innovation velocity: there are so many mature and rapidly evolving model serving technologies and groundbreaking ML capabilities that are democratized exclusively through ML APIs and services. We want to let users select the best technology available to them and benefit from features that might not be natively available on OpenSearch.
Ease of adoption: many users have already adopted or built their own ML platform. We want to let those users leverage their existing investments and approved technologies.
Facilitating an open ecosystem: we need an easier way for partners and community contributors to integrate ML technologies with OpenSearch. As an open and community-driven platform, it’s important for us to empower contributors to co-innovate and drive joint-GTM motions. We want to provide integrators with a solution that ensures their engineering investments have a low cost of failure and high ROI potential.

What is the developer experience going to be?

The developer within the context of this framework is someone who is building an integration on behalf of a model serving technology or API. The integrator creates a connector blueprint for a service like the OpenAI ChatGPT API or Amazon SageMaker Hosting Services by defining a blueprint (eg. JSON document) that describes a protocol that OpenSearch can use to communicate with an external ML model service.

More details on the blueprint spec and APIs are provided in the Connector Blueprint Section.

Sample Use Cases and workflow

Screenshot 2023-03-29 at 3 52 35 PM

There are three user types: admin, integrator(developer), and end users. Integrators or developers are the active community contributors who train and deploy models with an external model server, and provision connectors within OpenSearch to enable an integration with the remote model. Integrators can also publish the validated connectors as a Json document to 1) An OpenSearch repository that can be later downloaded by end users from OpenSearch Website, and 2) the local ml-connector index included in the OpenSearch distributions so any end user will directly use it, e.g. a certified SageMaker connector to run a NLP model. End users are the people and system that run queries that require remote model inference. Admin is the owner of OS domain who defines permissions and give proper permissions to developers and end users.

If the target model hasn’t already been deployed and published, the integrator will deploy the model on the model server technology the connector was designed to support.
The integrator can choose to publish the work as a community connector or a certified connector as described in the feature brief.
The admin/integrator can deploy connectors into OpenSearch Ml-common from multiple sources.
1. To deploy a community connector, the admin/integrator will download or copy the community connector definition from the OpenSearch website. Ml-common provides a new “Create Connector” API to index the template into the new ml_connector index, and the integrator will use this API to deploy the connector by providing the connector definition and the required user parameters.
2. To deploy a self-defined connector, the integrator will construct the connector template by himself and use the same API to create the connector.
3. If the admin/integrator is deploying a certified connector, they can skip this step. They can find the connector definition by searching the ml_connector index.
Once the remote model connector is created and deployed, end users can create virtual models inside ml-common to run remote inference by calling the remote server. The virtual model shares the current ML-Model structure and is stored in the existing ML-Model index. It does not contain physical model content but hosts all the model metas. Multiple virtual models could be created and associated with the same connector. Ml-common will provide a new “Create Model” API to create virtual models.
End-users use the existing “predict model” APIs to run remote inference and the CRUD APIs to manage virtual models, including search models. Deleting or updating a virtual model does not necessary mean deleting associated connector.
Ml-common will provide a new set of APIs to manage model connectors. The OpenSearch admins can run the APIs to view the active connectors, check the connector status, and update/search/delete connectors. These new APIs details are listed in the Rest APIs section of this design doc.

Proposed Solution

We allow customers to define a connector blueprint to connect to any model serving framework. Once the blueprint is created, the user can use the blueprint to provision a connector to enable secure communication within an OpenSearch cluster and your service/api. There will be a CRUD API for connectors in ML-Commons, and a new system index is created to store and manage the connectors. The blueprint is parametric and generalized enough that ML-Commons can parse it in a way that customers can create new connectors to their favored AI model by simply configuring a blueprint, achieving a low-to-no-code user experience.

To run a remote inference, the user needs to define a model which has the connector ID that you want to connector for remote inference. We name these model as "virtual models" in ml-common and they share the current model management API (i.e. upload, train, delete, inference) with other build-in physical models. Invoking the "Predict" API against a virtual model will literally run a remote inference to the remote server through the associated connector.

Screenshot 2023-04-17 at 11 18 38 AM

Connector Blueprint/Template Definition

For the request to create a new remote server connector in ml-commons plugin, users needs to provide a connector blueprint in the restful “Create Connector” API. The feature brief has provided a high level idea of the connector blueprint that should be general and parametric enough to support all model serving frameworks. To be more specific here, we will use a nested JSON template to define a connector. In a connector blueprint, there are two types of placeholders:

Variables with curly braces {} are placeholders whose values will be assigned either in the parameters of the blueprint, or in the runtime of ML-Commons remote inference.
“virtual_model” and “request” are keywords in the blueprint that declare the placeholder variables will not be assigned with value in the blueprint. Instead, variables prefixed with “virtual_model” needs to be replaced with value defined in the virtual model associated with the connector and variables prefixed with “request” means it’s value will be provided in the request body of the “Predict” API.

The following is the proposed blueprint spec. This what the partner is responsible for defining to create the integration.

{
"Metadata":{
    "connector_name": <string>,  // required
    "description": "The connector to public OpenAI model service for GPT 3.5",
    "version": 1
},
"Parameters:" {   // required
    "endpoint": <string>,
    "protocol": <"HTTPS"|"gRPC">,
    "auth": <string>,
    "content_type" : <string>,
    "model": <string>,
},
"Template": [ // required
   {
    "Predict" :  // required
    {
        "Method" : "POST",   // required
        "URL" : "https://${endpoint}/v1/chat/completions",
        "headers": "{
                    \"Content-Type\": \"${content_type}\",
                    \"Authorization\": \"Bearer ${virtual_model.credential.API_KEY}\"
                }",
        "request_body": "{
                    \"model\" : \"${model}\",
                    \"messages\" : \"${request.messages}\"
                }"
    }
   },
   {
    "Metadata": // required
    {
        "Method" : "GET",
        "URL" : "https://${endpoint}/v1/models/{model}"
    }
   }
 ]
}

Remote Inference Example

Using OpenAI as the example, creating an OpenAI connector will look like

POST _plugins/_ml/connectors/_create
{
    "Metadata":{
        "connector_name": "OpenAI Connector",
        "description": "The connector to public OpenAI model service for GPT 3.5",
        "version": 1
    },
   "Parameters:" {
        "endpoint": "api.openai.com",
        "protocol": "HTTP",
        "auth": "API_Key",
        "content_type" : "application/json",
        "model": "gpt-3.5-turbo"
    },
    // Template is stored in ml-common or downloaded from URL. 
    "Template": < "template_URL" | "template_id" >

# Response
{
    "connector_id" : "lblVmX8BO5w8y8RaYYvN",
    "status" : "COMPLETED"
}

Once the connector is active, users can create “remote models” through our existing model management APIs and perform inference as follows.

POST _plugins/_ml/models/_create
{
    "name": "Remote_Model_OpenAI",
    "version": 1,
    "description": "a virtual model to run inference against OpenAI",
    "connector_id" : "lblVmX8BO5w8y8RaYYvN",
    "credential_id" : "MzcIJX8BA7mbufL6DOwl",
    "content_type" : "application/json",
    "udf_id" : "xxx-xxx-xxx",
}

# Response
{
    "connector_id" : "openAI_model_id",
    "status" : "COMPLETED"
}

Two options are provided to invoke the “Predict” API for remote inference.

Option 1 - End users provide the full request body including the data to the remote server. ML-common forwards the request to the remote server.

POST /_plugins/_ml/models/openAI_model_id/_predict
{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Say this is a test!"}],
    "temperature": 0.7    // optional
}

Option 2 - End users provide a OS query and index name. ML-common parse the query and transform the input data to the correct format. (Authentication model require attention)

POST /_plugins/_ml/models/openAI_model_id/_predict
{
    "model": "gpt-3.5-turbo",
    "input_query": {
        "_source": ["role", "content"],
        "filter": [{ "range": { "publish_date": { "gte": "2023-01-01" }}}]
        "size": 10
    },
    "input_index": [
        "custom_data"
    ],
    "temperature": 0.7    // optional
}

Requested Feedback

We appreciate any and all feedback the community has.

Specifically, we are particularly interested in information around the following topics

Does your ML model from any ML tech fits into the blueprint that we proposed?
Does your ML serving platform require API key rotation or do you need to do API key rotation for a certain ML tech?
Adding batch optimized inference options?

jonfritz commented 1 year ago

Hey all! My team at Aryn noticed some recent development on the “feature/remote-inference” branch of ml-commons for a ChatConnector and query executor that go beyond the RFC published. We are also working to enable conversational applications with OpenSearch in a similar way. Our approach was to create and open source plug-ins and search pipelines, and it seems like now would be the right time to converge and work together on an approach and discuss the primitives. We couldn’t find a RFC for this work, and we would love to collaborate on the next steps and share our approach. We’ve started work on our own RFC for this functionality, and can share our thoughts in advance of publishing it. Does anyone know the developers working on this? I’d also like to kick off a quick sync to chat more - who else would be interested in joining? LMK.

Zhangxunmt commented 1 year ago

@jonfritz It's great to hear that you are building some cool plugins for search pipeline and conversational apps. Actually they are all in our roadmap this year. We are planning to build a new plugin based on this remote inference feature which is dedicated to handle conversational requests for customers using generative AI. I think it's quite possible that our approaches are mergeable to some degree. Can you please include our product manager @dylan-tong-aws and Sr SDE @ylwu-amzn in the sync up chat meeting?

jonfritz commented 1 year ago

Thanks Xun! A set of us chatted yesterday (with representatives from AWS), and reposting the next steps here that I added in the Slack channel: "Thanks folks for getting together yesterday to discuss approaches for conversational search. The next step from the call is that Ben or Austin will submit the RFC for using plug-ins and search pipelines to enable conversational search (using pluggable generative AI models) and “conversational memory” (a way to create, store, and add interactions to a conversation). This will be submitted in the next few days, and then let’s give some good feedback. Looking forward to the collaboration in building this functionality for OpenSearch customers!"

It will be great to get your, @dylan-tong-aws and @ylwu-amzn feedback on the RFC once it's posted, so we can have the community align on the approach to take for OpenSearch's conversational interface.

Zhangxunmt commented 1 year ago

Hi @jonfritz, yes sure. Please share the RFC once it's published. I will organize our team to take a look and provide feedbacks!

austintlee commented 1 year ago

@Zhangxunmt here you go - https://github.com/opensearch-project/ml-commons/issues/1150.

jonfritz commented 1 year ago

Also, @Zhangxunmt, with regards to "quite possible that our approaches are mergeable to some degree" - let's use the RFC to align on the way the OpenSearch community wants to architect this functionality, and iterate on it together using that mechanism. Let's make sure it meets the use cases you had in mind as well, and take one approach for the project.

austintlee commented 1 year ago

While working on #1150 (PR #1195), one thing I considered is to consume an HttpConnector to invoke OpenAI APIs directly without going through a remote model. Have you guys considered this approach? One benefit to this approach is that you don't have to rely on ML nodes just to be able to make calls to remote inference endpoints. What do you guys think about this approach?

@ylwu-amzn

ylwu-amzn commented 1 year ago

While working on #1150 (PR #1195), one thing I considered is to consume an HttpConnector to invoke OpenAI APIs directly without going through a remote model. Have you guys considered this approach? One benefit to this approach is that you don't have to rely on ML nodes just to be able to make calls to remote inference endpoints. What do you guys think about this approach?

Yes, we considered this option. We considered several other things like security, downstream impact, we decided to use remote model by leveraging the current model management framework.

you don't have to rely on ML nodes just to be able to make calls to remote inference endpoints

This concern has been addressed in https://github.com/opensearch-project/ml-commons/pull/1197

austintlee commented 1 year ago

I do like that plugins.ml_commons.task_dispatcher.eligible_node_role.remote_model and plugins.ml_commons.task_dispatcher.eligible_node_role.local_model have reasonable/sensible defaults. But I worry that you are introducing way too many knobs. I don't think that's justified just to force remote models to fit into the mold of local models.

There are performance and scale considerations a cluster admin needs to make when hosting (large) models locally. Sure, let's give them all the knobs they need to ensure these models don't bring down the cluster and impact non-ML workloads. But why do they have to be burdened with these superficial knobs for remote models that do not have any resource contention implications?

The description of these controls in #1197 comes across as if we are patching holes as we go. It would be nice to see these decisions being backed by customer feedback and use cases.

ylwu-amzn commented 1 year ago

I don't think that's justified just to force remote models to fit into the mold of local models.

Not quite get you, you can see we adding these settings to avoid "force remote models to fit into the mold of local models".

But why do they have to be burdened with these superficial knobs for remote models that do not have any resource contention implications?

Remote model is not free, they does consume resources. These will help if user want to only run remote model on ML nodes. Users can tune by themselves.

It would be nice to see these decisions being backed by customer feedback and use cases.

I don't think we should always wait for customer feedback/user-cases to build features. cc @dylan-tong-aws, do you have any customer feedback or use cases?

jonfritz commented 1 year ago

I don't think we should always wait for customer feedback/user-cases to build features.

My 2 cents: In most cases, it's good to work backwards from customers and use cases for new features. Otherwise, we'll be at risk of adding more knobs or complexity with minimal benefit. Interested to hear the customer-led insights driving these additions.

dylan-tong-aws commented 1 year ago

Let's have a meeting to discuss. There aren't supposed to be many knobs exposed to the user who provision connectors (admin or mlops/infra engineer). There are [two personas](https://github.com/opensearch-project/ml-commons/issues/881) for this extensibility framework, and we need to work on making that distinction clear. One persona is the integrator. This is an SDE that represents some technology provider. They need enough flexibility to describe an integration (blueprint) between OpenSearch and an external service via RESTful APIs. The blueprint should be designed in a way that the admin or MLOps engineer who provisions the connector is only exposed to a few configurations. CloudFormation is a good analogy. Think about the CloudFormation template developer versus an ops engineer who uses the template. Right now, our APIs expose the blueprint details, but I am advocating for to have these APIs refactored or overloaded so that they don't expose all the knobs intended for integrators.

We had internal discussions about an API to publish blueprints. In the future, we will have certified connector blueprints, which will be pre-installed.

An admin should be able to provision a connector like this:

POST /_plugins/_ml/connectors/_create

{ connector_blueprint_id: sagemaker_connctor, region: us-west2, end_point: lmi-model-2023-06-24-01-35-32-275 iam_role or access keys: xxxxxxxxxx }

An admin only needs to be exposed to user inputs required at provision or invocation time. Credentials are something that can be set when a connector is provisioned or updated. There are use cases where we might want to provide the ability for a parameter to be overridden at invocation time. For instance, for users that are using Amazon SageMaker multi-model endpoints, they should be able to provision one connector to back multiple OpenSearch managed (external) models. Amazon SageMaker needs a model identifier/name to route a request to the appropriate model being served on one model. Being able to specify a parameter at the model level that can be passed to a shared connector at invocation time makes it easy to support this use case.

We're actively discussing the next phase of enhancements, and these are among them.

dylan-tong-aws commented 1 year ago

I don't think that's justified just to force remote models to fit into the mold of local models.

Not quite get you, you can see we adding these settings to avoid "force remote models to fit into the mold of local models".

But why do they have to be burdened with these superficial knobs for remote models that do not have any resource contention implications?

Remote model is not free, they does consume resources. These will help if user want to only run remote model on ML nodes. Users can tune by themselves.

It would be nice to see these decisions being backed by customer feedback and use cases.

I don't think we should always wait for customer feedback/user-cases to build features. cc @dylan-tong-aws, do you have any customer feedback or use cases?

@ylwu-amzn , @austintlee, I've requested to setup a meeting over Slack. The specific configurations that @austintlee out were technical decisions. There were no explicit customer/business requirements for these knobs. Let's meet to discuss the concerns and the technical decision to expose these configurations.

With that said, we do have user/business requirements to ensure this framework is cost optimized. This feature should not, for instance, have dependencies on ml nodes. Currently, one could use a ml node as a proxy, but that should not be required. In fact, I advocate we disable this because there are no user requirements or known use cases that require this. There are also cluster settings like "plugins.ml_commons.only_run_on_ml_node: true" that need to be decoupled from external models. We're actively working on this. A user should not have to set this to false to use the connectors and external models.

sharmashivam2 commented 1 year ago

Hi @austintlee @Zhangxunmt

I tried to load the hugging face open source model and deployed it, It got deployed. I have deployed opensource hugging face gpt-2 model. I am following this documentation: Documentation Link

So when I am trying to create a Search pipeline for connector model I am not able to do so and getting error in response,

I have talked about this issue in the Opensearch forum also but didnt got any response regarding this issue The link to my issue is: Open search issue Link

I have also enabled plugins.ml_commons.rag_pipeline_feature_enabled still the issue is existing.

Any suggestion on this issue would be really appreciated

opensearch-project / ml-commons