showlab / Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
https://arxiv.org/abs/2408.12528
Apache License 2.0
1.04k stars 44 forks source link

Please can you elaborate on the experiemntal setups for Table 4? #30

Closed zhaoyanpeng closed 2 months ago

zhaoyanpeng commented 2 months ago

Hey thank you for your work and for open-sourcing the code!

I did not quite get the experimental setups used for Table 4: which pre-trained base model were you using with CLIP-ViT, what tuning data did you use for multimodal understanding, and in particular, what is the difference between "unified pre-train" vs. the non-unified?

Could you please elaborate on those?

Thanks,

Sierkinhane commented 2 months ago

Hi yanpeng, we follow llavav1.5 to perform multimodal understanding (we provide the dataset details in section 5.1). We use CLIP (openai/clip-vit-large-patch14-336) to extract visual representations and the unified-pertain means the weights of show-o after pertaining stages 1 & 2 and non-unified is actually the raw weights of phi1.5.

zhaoyanpeng commented 2 months ago

thank you for your clarification; it is now much clearer :)