How to do cross-modal retrieval?

carlthome commented 2 years ago

I'm curious about how to do cross-modal retrieval with the YouTube-8M dataset. I have videos with image and audio data, and would like to learn two encoders that embed both audio and RGB data into the same space, such that nearest neighbor lookups could be performed with audio embeddings to find related images, and vice versa.

Is there an easy way to extend the loss functions required by SimilarityModel to support two input heads?

Dataset signature: (features, labels) = ({'rgb': ..., 'audio': ...}, {'video_id': ...})

owenvallis commented 2 years ago

This would similar to the CLIP model. We are looking to add an example notebook for this at some point.

Layhan commented 1 year ago

Hi i was wondering if you did 'Import CLIP'?

tensorflow / similarity

How to do cross-modal retrieval? #223