Closed sumin1125 closed 2 years ago
I am looking this question's answer :D
The text encoder is a BiLSTM that has been jointly trained through DAMSM loss [1] as many GAN-based models. You can also try other text encoders (Bert, CLIP).
[1] Xu T, Zhang P, Huang Q, et al. Attngan: Fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 1316-1324.
What did you predict during pretraining and how did loss do it?