Closed axel588 closed 1 year ago
@axel588 i'm just being faithful to the paper. the paper's claim is that scaling T5 (popular brand of transformers) is enough and that CLIP is not needed
if you want something that involves CLIP, you can try training with this repository instead
@lucidrains Thanks for your answer, but integrating the possibility of using CLIP would be a big improvement, text conditionning using T5 is very bad for text to image, T5 is good at text to text generation but not at text to image, it's pretty much acting like a classifier, it really lacks a lot of creativity, with CLIP I can ask to draw a carrot looking like a coin, it draws me a circular carrot, with T5 it just will either generate a carrot or a coin, but it isn't capable of interpolating between the two.
DallE-2 doesnt give me great result for my dataset, but imagen does great !
@axel588 Have you integrated CLIP encoder with Imagen?
I managed to train on 700 000 images. I trained on lafite, which doesnt give esthetical great result, but it has a lot of creativity and it respect the prompt great ! But it is mostly because of the use of CLIP. On Imagen, using t5-base with cond=6.0 on sample gives most of the time the same image in the dataset or a non-coherent image (50% of the time). Could it be a great feature to add CLIP guided text to image ?