Open KCGD opened 3 years ago
i might not be completely right, still working to truly understand the inner workings-- but to my understanding it uses two pre-trained models, bigGAN and CLIP. CLIP has been trained to associate text and images, and bigGAN is trained to generate images. Putting them together you get:
text -> CLIP -> text encoding which associates to an image that fits the text well -> bigGAN, which attempts to make a "realistic" image from the encoding from CLIP.
probably someone else could explain better, but that's my understanding from an abstract level
Usually AIs train towards a tangible and absolute output but this does the complete opposite. How?