[FEATURE] Enhance the AI connector framework to support 1)Async Prediction and 2) Prediction with Streaming Response

Zhangxunmt commented 6 months ago

Is your feature request related to a problem?

Two enhancements are proposed in this feature to improve the Ml-Commons Connector framework.

Currently in the connector framework, we only have one way to predict remote models in realtime mode through API calls. This realtime invocation cannot handle the batch inference as proposed in https://github.com/opensearch-project/ml-commons/issues/1840. One important pre-requisite of batch inference is offline Endpoint invocation asynchronously. In additional to Bedrock that has the async model prediction, SageMaker also provides the API to invoke endpoints in async mode (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html). Different than the Async client http connections, this Async Prediction usually require the request payloads and the responses stored in storage service like S3.

The benefits of this async model invocation are multi-folds:
Non-blocking Execution: Ml-Commons can continue to perform other tasks after the async prediction call is finished without waiting for response.
Improved Responsiveness: Especially useful in web applications or any application where responsiveness is critical and needs written records.
Concurrency: This Async endpoint invocation APIs are typically implemented through internal queuing technique (https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html), so it doesn't have the common realtime throttling problems as discussed in https://github.com/opensearch-project/ml-commons/issues/2249.

As another improvement, we should also add the model invoking mode of streaming responses. Now SageMaker and OpenAI all support Model Predictions with streaming responses in https://platform.openai.com/docs/api-reference/streaming and https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html. We should integrate them in the connector as new Action Types.

The benefits of model invocation with streaming responses can be summarized:
Reduced Latency: Begin processing data as soon as it starts arriving. This will greatly reduce the latency that we saw in the query assistant, etc.
Improve efficiency and reduce memory usage: Handle large volumes of data without loading it all into memory at once. It saves the memory usage and improves the service stability.
Improved User Experience: Provide real-time feedback or updates based on the incoming data.

What solution would you like? Integrate the Async Invoke Model APIs from SageMaker and Bedrock. Integrate the Streaming payload response APIs from SageMaker, OpenAI, and make it general to support others.

What alternatives have you considered? The implementation should be done in a general way that is easily extended to new model sever platforms.

Do you have any additional context?

austintlee commented 6 months ago

For streaming - https://github.com/opensearch-project/OpenSearch/pull/13772

zane-neo commented 4 months ago

@Zhangxunmt Any update on this issue?

opensearch-project / ml-commons

[FEATURE] Enhance the AI connector framework to support 1)Async Prediction and 2) Prediction with Streaming Response #2484