lucidrains / DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
MIT License
5.55k stars 643 forks source link

DALLE trained on FashionGen Dataset RESULTS 💯 #443

Open alexriedel1 opened 2 years ago

alexriedel1 commented 2 years ago

DALLE on FashionGen

Text to image generation and re-ranking by CLIP

Best 16 of 48 generations ranked by CLIP

Generations from the training set (Including their Groundtruths)

Download (5) Download (6) Download (7) Download (8) Download (4)

Generations based on custom prompts (withouttheir Groundtruths)

Download (1) Download (2) Download (3) Download (9) Download

Model specifications

VAE Trained VQGAN for 1 epoch on Fashion-Gen dataset Embeddings: 1024 Batch size: 5

DALLE Trained DALLE for 1 epoch on Fashion-Gen dataset dim = 312 text_seq_len = 80 depth = 36 heads = 12 dim_head = 64 reversible = 0 attn_types =('full', 'axial_row', 'axial_col', 'conv_like')

Optimization Optimizer: Adam Learning rate: 4.5e-4 Gradient Clipping: 0.5 Batch size: 7

image

Asthestarsfalll commented 1 year ago

Hi, can you offer the Colab link and check points?

alexriedel1 commented 1 year ago

Hi, can you offer the Colab link and check points?

You'll find the trained Dall-E weights here: https://drive.google.com/uc?id=1kEHTTZH2YbbHZjY6fTWuPb5_D-7nQ866

Asthestarsfalll commented 1 year ago

@alexriedel1 Thank you! And I'm wondering which vocab you use, I only have the bpe_simple_vocab_16e6 supplied by openai

Asthestarsfalll commented 1 year ago

I download the weights, but it seems that it's parameters are different. image

alexriedel1 commented 1 year ago

Yes right, the text sequence length is 120, is this a problem for you?

Asthestarsfalll commented 1 year ago

No, It' s just different from the description of the model. image I'm wondering which bpe file you use, and why the num_text_tokens are such long.

alexriedel1 commented 1 year ago

I also used the default tokenizer in this project which uses bpe_simple_vocab_16e6 byte pair encoder https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/tokenizer.py. It uses a text token size of 49408 by default.

I increased the text sequence length to 120 because the FashionGen dataset uses quite long text descriptions to the images.

Asthestarsfalll commented 1 year ago

Thank you a lot!

killah-t-cell commented 1 month ago

Hi do you still have access to the Fashiongen dataset? I can't seem to find a good link for it.