microsoft / XPretrain

Multi-modality pre-training
Other
467 stars 36 forks source link

How to use HD-VILA as multimodal TextEncoder? #12

Closed Celia0u0 closed 1 year ago

bei21 commented 1 year ago

Which model do you refer to?

Celia0u0 commented 1 year ago

I am currently attempting to replicate the super-resolution results, and I find the paper says that HDVILA is used to extract text multimodal information. Would you happen to have any further details that could help clarify this part?

zengyh1900 commented 1 year ago

hi @Celia0u0 , you can check more details in Section D.3 Text-to-Visual Generation in the supplementary of this paper https://arxiv.org/abs/2111.10337 As shown in Figure 9 in the supplementary, HDVILA is used to encode the input image and text. After you get the visual embedding and text embedding, they are used to edit latent vectors based on StyleGAN.