Does Show-o directly complete generation in pixel space？

showlab / Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

https://arxiv.org/abs/2408.12528

Apache License 2.0

1.04k stars 44 forks source link

Open Delicious-Bitter-Melon opened 2 months ago

Delicious-Bitter-Melon commented 2 months ago

Thanks for your excellent work.

Does Show-o directly complete generation in pixel space, or does it complete generation in latent space through a VAE?

Sierkinhane commented 2 months ago

Hi, the generation is operated on the discrete token space through MAGVIT-v2.

Delicious-Bitter-Melon commented 2 months ago

Hi, the generation is operated on the discrete token space through MAGVIT-v2.

Thanks for your reply. Does MAGVIT-v2 directly tokenize from pixel space?

Sierkinhane commented 2 months ago

Exactly.