Closed Celia0u0 closed 1 year ago
I am currently attempting to replicate the super-resolution results, and I find the paper says that HDVILA is used to extract text multimodal information. Would you happen to have any further details that could help clarify this part?
hi @Celia0u0 , you can check more details in Section D.3 Text-to-Visual Generation in the supplementary of this paper https://arxiv.org/abs/2111.10337 As shown in Figure 9 in the supplementary, HDVILA is used to encode the input image and text. After you get the visual embedding and text embedding, they are used to edit latent vectors based on StyleGAN.
Which model do you refer to?