jina-ai / serve

☁️ Build multimodal AI applications with cloud-native stack
https://jina.ai/serve
Apache License 2.0
21.09k stars 2.22k forks source link

Is multimodal search in the roadmap? #987

Closed fsal closed 4 years ago

fsal commented 4 years ago

I think Jina is a very promising search tool, and I'd like to thank all of you for this great work!

Describe the feature

I'd like to know if you have in your roadmap to implement a multimodal search, i.e. make it possible to define a query that uses both text and images. My use case requires encoding queries composed of both images and text.

From what I've seen (correct me if I'm wrong), at the moment it is possible to query with only a specific mime_type per request and there are no drivers that can parse a message with more than one mime_type.

I've seen that @JoanFM already committed some tests and opened some PRs (#712, #704) to address this use case already in July, but I haven't found any progress in the last months, so I wonder if adding such feature is still in your roadmap.

JoanFM commented 4 years ago

Hello @fsal,

It is indeed in our roadmap. It can also be good for us to know different use cases.

Does your usecase require to embed in a single vector indexer documents with different modalities (somehow mixing the semantic information of different modalities) or you expect to index them in different indexes and target any of them depending on the query part?

fsal commented 4 years ago

In my use case, the text and the image of each query are first embedded separately (by a CNN and a Word2Vec vocabulary), then these two embeddings are fed to an encoder that produces a final embedding. This final embedding is used at query time to search the index. So, it is a single index in which each doc embedding is dependent on both modalities.

bwanglzu commented 4 years ago

@fsal you're correct, the input could be any modality (Multimedia with multi-modalities), and we should train an encoder to encode 2 (or even more) representations into a common feature space, then perform the matching.

@JoanFM any thoughts on that? We could do w2v + cnn -> representation, or even X + Y + ... -> representation. X, Y Z could be text, image, audio, sensor signals..

JoanFM commented 4 years ago

Hey @bwanglzu I agree, we should enable taking more than one input to the encoder. And it could also be of the same type (text1, text2 ...) or (front image, back image) ...

fsal commented 4 years ago

These improvements would be great and make Jina useful for many more use cases and companies. I mean, in the end the advantage of having a neural search rather than classical textual or visual searches is that you can index/query many types of different and unstructured data with a single tool.

bwanglzu commented 4 years ago

@fsal thanks for your input, this is in our roadmap and we're working on it! Design starts from this sprint and the implementation goes next, please keep an eye on it :)

jina-bot commented 4 years ago

This issue is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 4 days

JoanFM commented 4 years ago

Hello @fsal ,

Since the merge of #1113 we have created a driver that will allow to use multimodal encoders in jina. We plan to build an example to showcase the usage, but you can already take a look at it and see how the tests work if you want. We also plan to add better documentation about it.

bwanglzu commented 3 years ago

Hi @fsal we created an example for multimodality search here, it is an implementation of Composing Text and Image for Image Retrieval. If you're interested, try it out and give us some feedback.