lucidrains / DALLE2-pytorch

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
MIT License
11.14k stars 1.09k forks source link

What value should the loss of DiffusionPrior go down to? #78

Closed chinoll closed 2 years ago

chinoll commented 2 years ago

When I train DiffusionPrior, the loss decreases to 0.37 and then stops decreasing. In which range will the loss satisfy the requirement?

nousr commented 2 years ago

I can't speak to any specific value of loss that would "satisfy the requirement", but I can say that I've seen values around ~0.15 after many hours of training with L2 loss

rom1504 commented 2 years ago

https://wandb.ai/laion/diffusion-prior/runs/1blxu24j?workspace=user-rom1504 this is a run with the latest version of the code Trained for 500M samples Validation loss reaching about 0.3 Train loss 0.17

However you should probably check the other metrics rather than the loss (like cos similarity between text and predicted image, reaching 0.26 here)

We don't know yet what's the best metric to evaluate on. Best ideas are retrieval metric and clip guided generation for now (check #29 to know more)

lucidrains commented 2 years ago

@nousr the current diffusion prior training runs, are they with text encoding + mask or without?

nousr commented 2 years ago

@nousr the current diffusion prior training runs, are they with text encoding + mask or without?

@lucidrains I took the weekend off to hangout with family 😄 just getting back into the swing of things today

@rom1504 @krish240574 can you confirm are we still doing embedding only? I see condition_on_text_encodings=false in the wandb linked above.

rom1504 commented 2 years ago

Yeah indeed still embedding only.

We could try text + embedding but I need to do a bit of work in embedding reader to support that well (idea it to read the npy where the embedding are and the parquet where the text are at the same time, which is the npy_parquet format of embedding reader, but it needs a bit of work to be performant)

rom1504 commented 2 years ago

https://github.com/rom1504/embedding-reader/pull/24/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R55 interface looks like that if you want to try it sooner rather than later

The performance improvement won't change the API

lucidrains commented 2 years ago

@nousr the current diffusion prior training runs, are they with text encoding + mask or without?

@lucidrains I took the weekend off to hangout with family smile just getting back into the swing of things today

@rom1504 @krish240574 can you confirm are we still doing embedding only? I see condition_on_text_encodings=false in the wandb linked above.

yeah same! (plus sending out some resumes to companies around the area :laughing:)

ah ok, it looks like text encodings aren't present yet, but we could always just train it slowly with CLIP passed into the DiffusionPrior instance

lucidrains commented 2 years ago

from what i gathered in the paper, the text encodings helped for the diffusion prior, but not for the decoder (but it wouldn't hurt to have them present for both)

lucidrains commented 2 years ago

Yeah indeed still embedding only.

We could try text + embedding but I need to do a bit of work in embedding reader to support that well (idea it to read the npy where the embedding are and the parquet where the text are at the same time, which is the npy_parquet format of embedding reader, but it needs a bit of work to be performant)

yea, its tricky because the text encodings also need to have an associated boolean mask (variable lengthed encodings) :cry:

lucidrains commented 2 years ago

@rom1504 i can always start working on some memmapped solution on my end

rom1504 commented 2 years ago

Well my current implementation in the reader is slow but I mean slow as in 100k sample/s whereas it should be 10M sample/s

You could adapt the prior training script to add the option now

lucidrains commented 2 years ago

@rom1504 ohh got it! yup, plan on adding to the current prior training script for sure

rom1504 commented 2 years ago

https://github.com/rom1504/embedding-reader/pull/24/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R55 yeah so just got to use that in the training script (that PR is not needed to be merged to use it, that's just doc)

rom1504 commented 2 years ago

However I'd recommend to keep the option to not use the text, as I figure training with the text will be much slower.

lucidrains commented 2 years ago

@rom1504 yea, we should definitely keep the option, but probably should strive for text encodings to be included diffusion prior training. it seems necessary from the paper (and plus Katherine has it)

lucidrains commented 2 years ago

@chinoll anyways, to answer your question, join the Laion discord!