Preserving human face likeness

blistick commented 1 year ago

First off, congratulations on this project and thank you so much for your work!

I'm testing using your SDXL demo, and am generally getting good results. However, my use case is really for human "personalization", like DreamBooth. I've tried using your multi-model prompt with some of my images and the likeness of the face is not quite what I'd like, meaning that the face shows variation where I wish it would be more true to the original input.

Do you have any suggestions on settings or values that I could tweak to try and improve this? Or other ideas?

Again, thanks for everything!

haofanwang commented 1 year ago

I don't think it can. This is highly restricted by CLIP embedding.

zhangjun001 commented 1 year ago

The point from @haofanwang is correct. High-level CLIP image embedding may not be able to extract enough facial details for "personalization". Although limited by the above problem, we can still enhance the face by appropriately increasing the proportion of the face in the image. Nevertheless, the capability is still limited by the input size (224*224) of CLIP image encoder. As we discussed in our paper, "it can only generate images that resemble the reference images in content and style." DreamBooth or other fine-tuning techniques can encode the common attributes (e.g., face) of training images to the network. We'll also continue to look at how we can improve to ensure person/object consistency.

blistick commented 1 year ago

@zhangjun001 thank you for your reply, and again for your fine work. I will try making the subject face as large as possible to see if I get achieve better likeness, and I also look forward to future improvements.

scarbain commented 1 year ago

The point from @haofanwang is correct. High-level CLIP image embedding may not be able to extract enough facial details for "personalization". Although limited by the above problem, we can still enhance the face by appropriately increasing the proportion of the face in the image. Nevertheless, the capability is still limited by the input size (224*224) of CLIP image encoder. As we discussed in our paper, "it can only generate images that resemble the reference images in content and style." DreamBooth or other fine-tuning techniques can encode the common attributes (e.g., face) of training images to the network. We'll also continue to look at how we can improve to ensure person/object consistency.

@zhangjun001 What about using embeddings from a facial recognition model instead of clip embeddings ? Like in this paper they used embeddings from ArcFace : https://github.com/junshutang/3DFaceShop

I've tried this technique with ControlNet, removing the last two fully connected layers from the VGGFace in order to get spatial facial feature maps. I've modified the ControlNet tiny network and convolution layers to accomodate with the new input shape but without any success. The model doesn't seem to be learning anything but it might be because of the architecture I've chosen for the tiny network which might be really not good (I'm a beginner at ML) and/or not enough training. I'm currently at 25K steps which is not that much but from my previous experiences training with ControlNet, I always could see some convergence way before that much steps.

Do you think it could work with your architecture ?

zhangjun001 commented 1 year ago

@scarbain I think the face-specific embeddings should favor the face generation. However, if the embeddings are not well aligned to the text space, it will require more resources for training (more training time) and alignment (more parameters may be needed before cross attention).