Closed ggoggam closed 2 years ago
I think there are some image-text retrieval capabilities already, such as ViltForImageAndTextRetrieval. But these can only work for a toy example set of queries and keys due to its interaction-based nature. A true retrieval (cross-model or not) would probably need datasets
and faiss
. I think that may be too complicated for for a pipeline
?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Feature request
Given queries and keys, the proposed pipeline returns a ranked list of keys that are most similar to each respective query. This pipeline should support uni-modal and cross-modal retrieval, i.e.
Prominent use cases would be:
Motivation
I was looking for a use-case for CLIP for cross-modal retrieval, but the current pipeline for CLIP does not seem to support cross-modal retrieval. I believe there is a demand for this pipeline.
Your contribution