Impact of Various Representations for Multimodal Understanding

Hi, these results are for multimodal understanding, not generation. Besides, these continuous features are from CLIP-ViT, not VAE typically used in diffusion models.

The discrepancy between continuous and discrete features for understanding may originate from the various scales of pre-training data. CLIP was pre-trained on around 400M image-text pairs, which is significantly larger than that of ours (35M). When using the same scale of training data, as shown in exp4 and exp6 in Table 4, discrete tokens demonstrate comparable performance to the continuous ones from MAGVIT-v2. Besides, continuous features from CLIP-ViT are extracted from images in a resolution of 336x336, and discrete tokens are extracted from images of 256x256.

The core reason we adopt discrete tokens is that the unified model can be tasked with a more unified learning objective, predicting discrete tokens for both text and images.

showlab / Show-o

Impact of Various Representations for Multimodal Understanding #13