showlab / Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
https://arxiv.org/abs/2408.12528
Apache License 2.0
806 stars 36 forks source link

Impact of Various Representations for Multimodal Understanding #13

Closed Doctor-James closed 3 weeks ago

Doctor-James commented 3 weeks ago

Thank you for your outstanding work. This paper follows the MAGVIT-v2 framework, utilizing a discrete tokenizer. However, the results from the ablation study and Table 1 indicate that continuous feature representation significantly outperforms discrete representation. Yet, the paper ultimately opts for discrete representation. What considerations led to this choice?

Sierkinhane commented 3 weeks ago

Hi, these results are for multimodal understanding, not generation. Besides, these continuous features are from CLIP-ViT, not VAE typically used in diffusion models.

The discrepancy between continuous and discrete features for understanding may originate from the various scales of pre-training data. CLIP was pre-trained on around 400M image-text pairs, which is significantly larger than that of ours (35M). When using the same scale of training data, as shown in exp4 and exp6 in Table 4, discrete tokens demonstrate comparable performance to the continuous ones from MAGVIT-v2. Besides, continuous features from CLIP-ViT are extracted from images in a resolution of 336x336, and discrete tokens are extracted from images of 256x256.

The core reason we adopt discrete tokens is that the unified model can be tasked with a more unified learning objective, predicting discrete tokens for both text and images.