How did you train this?

i might not be completely right, still working to truly understand the inner workings-- but to my understanding it uses two pre-trained models, bigGAN and CLIP. CLIP has been trained to associate text and images, and bigGAN is trained to generate images. Putting them together you get:

text -> CLIP -> text encoding which associates to an image that fits the text well -> bigGAN, which attempts to make a "realistic" image from the encoding from CLIP.

probably someone else could explain better, but that's my understanding from an abstract level

lucidrains / big-sleep

How did you train this? #90