Closed xiankgx closed 2 years ago
Hello,Do you have a model trained now? I would like to ask you about the specific value of the Unet parameter. The results I get so far are not good.
I can provide you the params, but it probably doesn't make sense for you, because I'm working on latent rather than image space. Instead of providing and predicting image pixels, I'm providing and predicting the latent feature map of the fixed autoencoder from Stable/Latent Diffusion.
But here is it anyway:
--image_encoder_name "vggface2" \
--num_image_tokens 8 \
--dim 320 \
--dim_mults "1,1,2,3" \
--cond_dim 512 \
--layer_attns "0,0,1,1" \
--layer_cross_attns "0,0,1,1" \
Thank you for your reply, I still have some questions, do you only use a Unet network.And @what is the role of the parameter num_image_tokens, I only saw its definition in the file imagen_pytorch.py and did not see its use
In the original implementation, the conditioning model is the T5 text encoder. In my model, I have modified to take the embedding from a vggface2 feature extractor (facenet_pytorch) which outputs an embedding of 512 dim for each image.
For example, if I have a batch size of 8, the size of this tensor would be (8, 512). However, the original model accepts text condition in the form of (batch_size, sequence_length, embedding_dim). Hence, I applied a linear layer to go from (8, 512) to (8, num_image_tokens(=8) * 512) and then reshape to (8, 8, 512). This is then used as the text conditioning in imagen (text_embeds).
Hence, this parameter you need not concern yourself with.
Thanks
This is the code used for generating the images with the model.