Vector embedding processor

muthup commented 1 year ago

Vector search is feature of OpenSearch that is gaining prominence with recent emergence of generative AI and other ML use cases. Currently users of OpenSearch either use external embedding methods or use OpenSearch ingest pipelines to generate text embedding. Other processor currently present in the ingest pipelines are Append, Bytes Convert, CSV, Date, IP2Geo, Lowercase Text embedding. However using these processors especially text embedding causes additional CPU usage on OpenSearch. Data Prepper helps with preparing and ingesting data into OpenSearch. Having a text embedding processor in Data Prepper will help reduce this CPU need on OpenSearch and can help with the emerging use cases for OpenSearch

dlvenable commented 11 months ago

Thank you for this request @muthup . Do you think Data Prepper would use ML models from the OpenSearch cluster itself? That is, would you want it to retrieve the model by model Id from OpenSearch? Or are you looking to get the model from elsewhere?

travisbenedict commented 11 months ago

AWS Bedrock also supports a few different models for embeddings that can be invoked via an API. Users could supply the modelId as part of their yaml.

Example API invocation using the AWS CLI

 % aws bedrock-runtime invoke-model --model-id "amazon.titan-embed-text-v1" --body '{ "inputText": "my text" }'

https://docs.aws.amazon.com/bedrock/latest/userguide/embeddings.html

kkondaka commented 11 months ago

@dlvenable @travisbenedict I guess we have to discuss if we want to this as a bump-in-the-wire (ie call external AI/ML service and wait for the response) or some other way.

dlvenable commented 11 months ago

I think ideally we can have the models in Data Prepper. Right now, OpenSearch stores models and the OpenSearch ingest pipeline will use the models it has.

Also, it does not appear that OpenSearch's model APIs support external requests to create vectors. We could add this if we didn't want to store the models in Data Prepper. But, we'd need to be able do this in batches to avoid a multitude of requests.

Jon-AtAWS commented 11 months ago

Thinking more broadly - there are other cases where you need to enrich data in the pipe with a callout to other system(s). You could be bringing together data from multiple systems, pulling up-to-the-second information, or generating embeddings. Maybe the best thing is a more generic "call-out" processor, that you configure to connect with these systems, and supply a form of query.

As far as embeddings, DP should call through OpenSearch. Calling through OpenSearch ensures that the model used is the same, the configuration is in one place, and DP does not then have to run ml-commons somewhere. The configuration in OpenSearch is challenging, and interactive (you have to run tasks, collect ids, etc.). We should not duplicate that effort and I don't think it can be declarative in the config. To me, multiple configurations is a deal-breaker: the config should be one time, and in one place.

I understand calling OpenSearch will require a new API (to generate an embedding), and that's maybe not an API we want to support. We'll have to figure out how to accomplish that.

Jon-AtAWS commented 11 months ago

@travisbenedict - I've seen in our connector blueprints where we maybe need some pre- and post- processing for working with each of the model hosting services. Will we be able to cover what we need with a simple API?

We need to think about auth and access control as well for 3P hosting. Within AWS we can accomplish A&AC with IAM but I don't know how that works for other 3P hosts (uname/pw? key?) and other cloud providers. That's another reason to roll all of this into OpenSearch itself so the security config is in one place with the connector config, the model, the model group, etc. etc.

sean-zheng-amazon commented 11 months ago

Having a text embedding processor in Data Prepper will help reduce this CPU need on OpenSearch and can help with the emerging use cases for OpenSearch

@muthup actually the model inference runs on remote server like SageMaker or OpenAI, having text embedding processor in data prepper won't help much reducing cpu need for OS cluster.

sean-zheng-amazon commented 11 months ago

I've seen in our connector blueprints where we maybe need some pre- and post- processing for working with each of the model hosting services. Will we be able to cover what we need with a simple API?

@Jon-AtAWS you are right, it's not just a simple inference API, we need the blueprint support, and different pre/post processing for various ML inference service, not to mention the logic we put into ingestion pipeline to support embedding/text combined ingestion. I strongly suggest not to replicate those work in data prepper. Instead we should leverage what we've built for OS, e.g. let data prepper call OS ingestion pipeline to do the job.

sharraj commented 11 months ago

It looks like these are main requirements we should design for 1. Auto scalable cluster which can facilitate connecting to desired external service for any needed enrichment (say vector etc.) in asynchronous fashion and can do all this in streaming way with out overloading JVMs of Data nodes 2. Join data after response from external services and merge enrichments in the documents hosted in OpenSearch. First focus can be to build this with external services like Bedrock or Sagemaker.

Jon-AtAWS commented 11 months ago

@sharraj - I think for vectors, we also want to build API support in OpenSearch for the OSI plugin to call OpenSearch to get the embeddings.

kkondaka commented 11 months ago

@Jon-AtAWS If we call API build to call OpenSearch, is the model code would be running on the DataPrepper nodes, right? Don't we need compute nodes with GPUs for some/all models to run efficiently?

Jon-AtAWS commented 11 months ago

I don't want to have to set up connectors, models, security, etc. etc. 2 times. So, I don't know that the model code could be running in DP. Even if there's a way for DP to pull the config from OpenSearch, I'm not sure that it's right to run ml-commons on DP workers.

Occam's razor says have one config and one impl. That means being able to call OpenSearch.

If the model is hosted remotely, then you don't have to run GPUs in the cluster (or in DP).

kkondaka commented 11 months ago

@Jon-AtAWS completely agree. Ideal way is for OpenSearch to provide API that will work like an RPC call (with compute running on GPU nodes that are part of OpenSearch cluster). But the original description of the issue (by Muthu) seem to indicate that we want to offload "compute" from OpenSearch cluster.

Purely from DataPrepper perspective, I perfer RPC calls to OpenSearch/SageMaker/BedRock etc where compute (and any storage for checkpointing etc) is handled by the Remore node (in this case OpenSearch/SageMaker/BedRock).

opensearch-project / data-prepper

Vector embedding processor #3597