opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
59 stars 61 forks source link

[FEATURE] OpenSearch Feature Brief - Multimodal Search Support for Neural Search #473

Open dylan-tong-aws opened 11 months ago

dylan-tong-aws commented 11 months ago

What are you proposing?

We’re adding multimodal (text and image) search support to our Neural Search experience. This capability will enable users to add multimodal search capabilities to OpenSearch-powered applications without having to build and manage custom middleware to integrate multimodal models into OpenSearch workflows.

Text and image multimodal search enables users to search on image and text pairs like product catalog items (product image and description) based on visual and semantic similarity. This enables new search experiences which can deliver more relevant results. For instance, users can search for “white blouse” to retrieve product images—the machine learning (ML) model that powers this experience is able to associate semantics and visual characteristics. Unlike traditional methods, there isn’t the requirement to manually manage and index metadata to enable comparable search capabilities. Furthermore, users can also search by image to retrieve visually similar products. Lastly, users can search using both text and image such as finding the products most similar to a particular product catalog item based on semantic and visual similarity.

We want to enable this capability via the Neural Search experience so that OpenSearch users can infuse multimodal search capabilities—like they can for semantic search—into applications with less effort to accelerate their rate of innovation.

Which users have asked for this feature?

This feature was driven by AWS customer demand.

What problems are you trying to solve?

Text and image multimodal search will help our customers improve image search relevancy. Traditional image search is text based search in this disguise. It requires labor to create metadata to describe images, which is a process that is hard to scale due to the speed and cost of labor. Thus, traditional image search performance and freshness can be limited to economics and the ability to maintain high metadata quality.

Multimodal search involves leveraging multimodal embedding models that are trained to understand semantics and visual similarity enabling the aforementioned search experiences without having to produce and maintain image metadata. Furthermore, users can perform visual similarity search. It’s not always easy to describe an image in words. This feature can provide users with the option to match images by visual similarity empowering users to discover more relevant images when visual characteristics are hard to describe.

What is the developer experience going to be?

The developer experience will be the same as the neural search experience for semantic search except we’re adding enhancements to allow users to provide image data via the query, index and ingest processor APIs. Initially, the feature will be powered by an AI connector to Amazon Bedrock’s Multimodal API. New connectors can be added based on user demand and community contributions.

Are there any security considerations?

We’re building this feature on the existing security controls created for semantic search. We’ll support the same granular security controls.

Are there any breaking changes to the API?

No

What is the user experience going to be?

The end user experience will be the same as what we’ve provided for semantic search via the neural search experience. Multimodal search is powered by our vector search engine (k-NN), but users won’t have to run vector search queries. Instead, they can run queries via text, image (binary type) or text and image pairs.

Are there breaking changes to the User Experience?

No.

Why should it be built? Any reason not to?

Refer to the first response to the what/why question.

Any reason why we shouldn’t built this? Some developers will want full flexibility and they might choose to build their multimodal search (vector search) application on our core vector search engine (k-NN). We’ll continue to support users with this option while working on improving our framework so that we provide users with a simpler solution with minimal contraints.

What will it take to execute?

We’ll be enhancing the neural search plugin APIs and creating a new AI connector for Amazon Bedrock multimodal APIs.

Any remaining open questions?

Community feedback is welcome.

noCharger commented 11 months ago

@dylan-tong-aws do you think this could go into sepecific plugins like ml-commons?

msfroh commented 10 months ago

@opensearch-project/admin -- Can we transfer this to the neural-search repo?

navneet1v commented 10 months ago

The developer experience will be the same as the neural search experience for semantic search except we’re adding enhancements to allow users to provide image data via the query, index and ingest processor APIs

@dylan-tong-aws can you elaborate this more. I am not sure what we want here.