[FEATURE] Allow more content type for neural query with multimodal

martin-gaievski commented 8 months ago

Is your feature request related to a problem?

Currently neural-search supports text and image fields for generation of embeddings in both ingestion and search. Content can be of other types like audio or video information, and that is not supported today, e.g. for search there are only query_text and query_image fields.

What solution would you like?

Ability to pass content like audio or video for data ingestion and search.

What alternatives have you considered?

We can use other solutions to generate embeddings for audio or video content, and then post process results from OpenSearch and other systems.

Do you have any additional context?

It's a good extension for https://github.com/opensearch-project/neural-search/issues/318

Sanjana679 commented 7 months ago

For videos, does it make sense to extract all the frames in a video and then generate embeddings for each frame? Likewise, for audio, would it make sense to make a transcription of the audio and then generate embeddings on the transcript?

I imagine there are issues with these approaches, but these were my first thoughts and I was wondering if anyone had suggestions for something better.

heemin32 commented 7 months ago

For videos, embeddings for frame makes sense. For audio, transcription will lose some information like intonation or volume of the audio.

opensearch-project / neural-search