lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
MIT License
8.11k stars 768 forks source link

Why not using CLIP ? #328

Closed axel588 closed 1 year ago

axel588 commented 1 year ago

I managed to train on 700 000 images. I trained on lafite, which doesnt give esthetical great result, but it has a lot of creativity and it respect the prompt great ! But it is mostly because of the use of CLIP. On Imagen, using t5-base with cond=6.0 on sample gives most of the time the same image in the dataset or a non-coherent image (50% of the time). Could it be a great feature to add CLIP guided text to image ?

lucidrains commented 1 year ago

@axel588 i'm just being faithful to the paper. the paper's claim is that scaling T5 (popular brand of transformers) is enough and that CLIP is not needed

if you want something that involves CLIP, you can try training with this repository instead

axel588 commented 1 year ago

@lucidrains Thanks for your answer, but integrating the possibility of using CLIP would be a big improvement, text conditionning using T5 is very bad for text to image, T5 is good at text to text generation but not at text to image, it's pretty much acting like a classifier, it really lacks a lot of creativity, with CLIP I can ask to draw a carrot looking like a coin, it draws me a circular carrot, with T5 it just will either generate a carrot or a coin, but it isn't capable of interpolating between the two.

DallE-2 doesnt give me great result for my dataset, but imagen does great !

lthilnklover commented 6 months ago

@axel588 Have you integrated CLIP encoder with Imagen?