Question about SHOW-O's CLIP version

showlab / Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

https://arxiv.org/abs/2408.12528

Apache License 2.0

1.04k stars 44 forks source link

Open hills-code opened 2 months ago

hills-code commented 2 months ago

Thanks for your great work!

img_v3_02f4_0fa82d4e-4e5a-4a08-adad-5532f5107fcg

I want to know can the show-o+ in this table generate images or it just serves the understanding tasks.

Sierkinhane commented 1 month ago

Hi, clip features are only for understanding.

hills-code commented 1 month ago

So it means this model can not generate and can only do understanding tasks?

Sierkinhane commented 1 month ago

This model can also generate images. Generation does not need any image inputs and just uses [mask] tokens as input.