Question about diffusion prior

xiaotingxuan commented 2 years ago

Hi，In my project, I only train the diffusion prior network.

I use train_prior_config.example.json which is provided in here I only change the hyperparameter "use_ema" to false. And I use MScoco dataset(karpathy split) to train . During training, the loss type is "mse loss", and the train_Ioss can go down to 0.2.

For inference, I use diffusion_prior.sample(tokenized_text, n_samples_per_batch=2, cond_scale=1.0) to generate CLIP image embedding

I think the generated CLIP image embedding should be similar with the ground-truth CLIP image embedding. I find the the cosine similarity between them are about 0.7. Is this normal?

Lixin-Liu commented 2 years ago

Hi, I have a question. How long did it take you to train the model?

xiaotingxuan commented 2 years ago

About 70 minutes for one epoch，use one NVIDIA TITAN RTX

Lixin-Liu commented 2 years ago

Thank you!

vae1207 commented 2 years ago

How do you process the dataset？I don't know why my loss value can't go down, Looks so weird

cxhermagic commented 2 years ago

I want to train Diffusion Prior model , how to get the PriorEmbeddingDataset? Thank you.

xiaotingxuan commented 2 years ago

If you mean get clip text embedding and clip image embedding for train prior，I think you can read ClipCap's code for reference. Extract CLIP features, then create your PriorEmbeddingDataset

lucidrains / DALLE2-pytorch

Question about diffusion prior #265