Closed zhaoyanpeng closed 2 months ago
Hi yanpeng, we follow llavav1.5 to perform multimodal understanding (we provide the dataset details in section 5.1). We use CLIP (openai/clip-vit-large-patch14-336) to extract visual representations and the unified-pertain means the weights of show-o after pertaining stages 1 & 2 and non-unified is actually the raw weights of phi1.5.
thank you for your clarification; it is now much clearer :)
Hey thank you for your work and for open-sourcing the code!
I did not quite get the experimental setups used for Table 4: which pre-trained base model were you using with CLIP-ViT, what tuning data did you use for multimodal understanding, and in particular, what is the difference between "unified pre-train" vs. the non-unified?
Could you please elaborate on those?
Thanks,