I'm curious about how to do cross-modal retrieval with the YouTube-8M dataset. I have videos with image and audio data, and would like to learn two encoders that embed both audio and RGB data into the same space, such that nearest neighbor lookups could be performed with audio embeddings to find related images, and vice versa.
Is there an easy way to extend the loss functions required by SimilarityModel to support two input heads?
I'm curious about how to do cross-modal retrieval with the YouTube-8M dataset. I have videos with image and audio data, and would like to learn two encoders that embed both audio and RGB data into the same space, such that nearest neighbor lookups could be performed with audio embeddings to find related images, and vice versa.
Is there an easy way to extend the loss functions required by
SimilarityModel
to support two input heads?Dataset signature:
(features, labels) = ({'rgb': ..., 'audio': ...}, {'video_id': ...})