Add pipeline for cross-modal / uni-modal ranking

ggoggam commented 2 years ago

Feature request

Given queries and keys, the proposed pipeline returns a ranked list of keys that are most similar to each respective query. This pipeline should support uni-modal and cross-modal retrieval, i.e.

Text-to-Text
Text-to-Image
Image-to-Text
Image-to-Image

Prominent use cases would be:

Using BERT family of models to perform text-to-text retrieval
Using multi-modal models such as CLIP to perform any of the retrieval methods above. There can be multiple ranking methods for different multi-modal models, for instance
- For VILT, we can use [CLS] pooled image-text matching score for ranking
- For CLIP, we can use logits_per_modality for cross-modal similarity score for ranking
- For ALBEF (https://github.com/huggingface/transformers/issues/17224), we have a two-stage (coarse-to-fine) ranking (image-text similarity -> [CLS] pooled image-text matching score)

Motivation

I was looking for a use-case for CLIP for cross-modal retrieval, but the current pipeline for CLIP does not seem to support cross-modal retrieval. I believe there is a demand for this pipeline.

Your contribution

I can help with the implementation once we polish the parameter definitions and outputs!

sijunhe commented 2 years ago

I think there are some image-text retrieval capabilities already, such as ViltForImageAndTextRetrieval. But these can only work for a toy example set of queries and keys due to its interaction-based nature. A true retrieval (cross-model or not) would probably need datasets and faiss. I think that may be too complicated for for a pipeline?

ggoggam commented 2 years ago

I believe there is no unified pipeline for cross-modal search that can be applied to different models.
Thank you for clarifying. What I had in mind was more of a ranking than a retrieval since I do not expect to have index search baked into the pipeline.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers