! This paper is under review on the experimental track of the Journal of Visualization and Interaction.
Authors: @ginihumer
OC: @mjskay
AE: TBD
R1: TBD
R2: TBD
R3: TBD
Interactive article submitted to the Journal of Visualization and Interaction.
Introduction: Multi-modal contrastive learning models are trained to map data from two or more modalities to a shared embedding space. This latent data representation can then be used for zero- or few-shot classification, cross-modal data retrieval, or generation tasks. Although remarkable results have been reported when testing multi-modal models on these tasks, understanding the latent representations remains challenging. In particular, many multi-modal models exhibit a phenomenon called the “modality gap”, leading to a latent space that cleanly separates the modalities.
Conclusion: This article introduces and compares three models trained on image-text pairs. We use these models and interactive visualizations to explain where the modality gap arises from, how it can be closed, and why closing it is important. In the second part, we introduce “Amumo”, a framework we implemented for analyzing multi-modal models. We describe various analysis tasks that can be performed with Amumo. In particular, Amumo can be used for (i) analyzing models, (ii) comparing models with each other, and (iii) analyzing multi-modal datasets. We demonstrate Amumo’s capabilities and generalizability using image, text, audio, and molecule data in combination with several different models.
Implementation: For smooth integration into research workflows, we implemented Amumo as a Python package with Jupyter widgets. We implemented the interactive visualizations in this article with JavaScript and plotly.js.
Demonstration & Materials: A minimal usage demonstration of Amumo is deployed with MyBinder. We also provide a demonstration of analyzing CLOOME with Amumo. The code for the Amumo python package and guidelines on how to use it can be found in the github repository.