microsoft / DiscoFaceGAN

Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning (CVPR 2020 Oral)
MIT License
640 stars 98 forks source link

Demo code of achieving expression transfer of real images. #6

Open zgxiangyang opened 4 years ago

zgxiangyang commented 4 years ago

Could you please provide the demo code of transferring expression of a real image to a generated image.

zgxiangyang commented 4 years ago

The extracted coef's shape is (257,0), while the output shape of z_to_lambda_mapping is (254)。What is the relationship between them?

YuDeng commented 4 years ago

In face image generation we eliminate the 3D world translation (assume it to be zero) for a face. So our input to the generator consists only of identity, expression, pose, and lighting in a total of 254 dimensions, while the coefficients extracted from the 3D reconstruction network also contain translation (3 dimensions) so it is in the shape of 257.

haibo-qiu commented 4 years ago

Hi YuDeng

Impressive work!

I am trying to achieve non-identity factors (including expression, illumination, and pose) transfer of real images, but I get unexpected results that do not preserve the identity information. Specifically, my target is to manipulate source image with the expression, illumination, and pose of the reference image while keeping identity unchanged. Here are my source and reference images. image

My process is kind of simple and naive.

  1. utilizing the R_Net to extract the coefficient (257) of both source and reference image, and discarding the last three elements (254).
  2. Combining the identity coefficient of the source image with the other three factors coefficients of the reference image to generate a new coefficient for later face generation.
  3. Adding random noise to the above coefficient and utilizing the truncate_generation to obtain the manipulated results.

Results with different noises image

It looks like the generated images catch the expression, illumination, and pose of the reference image, but they do not preserve the identity information of the source image. Does the gap between the coefficient from R_Net and the coefficient learned by your network lead to this problem?

Or I have to do manipulations in W+ space as you mentioned in your paper, but how? I have noticed your answer here (https://github.com/microsoft/DisentangledFaceGAN/issues/9#issuecomment-658024542). Does it mean that if I want to manipulate one image, I need to utilize backprop to obtain its latent code first, and then vary this code to achieve manipulation? If so, do you have any convenient method to accelerate this process?

YuDeng commented 3 years ago

@haibo-qiu. Hi, sorry that I have not noticed this issue until now. Hope that my answer is not too late.

As you mentioned, there is a gap between the extracted identity coefficients from R_Net and the identity coefficients received by the generator during training. The generator is trained with coefficients sampled from a VAE which does not faithfully capture the original distribution of real identity coefficients. As a result, if you extract identity of a real image and give it to the generator, most likely the generator will produce an image with different identity.

To alleviate this problem, a better way is to embed a real image into the W+ space and modify it in that space, which is just what we did in our paper. However, the optimization process will take several minutes for each image which is quite slow. Besides, embedding a real image into W+ space is risky in that W+ space is not as good as W space and Z space which guarantee disentanglement between different semantic factors.

Currently there are some method trying to embed real images into GAN's latent space that can faithfully reconstruct the input meanwhile provide reasonable semantic editing, for example In-Domain GAN Inversion for Real Image Editing. I think these papers might help in your case.

haibo-qiu commented 2 years ago

Hi @YuDeng,

Thanks for your kind reply : )

When I realized the gap in identity coefficient and the risk of W+ space, I changed my thought to explore using synthetically generated face images for face recognition.

With your DiscoFaceGAN model, I can control several factors like expression, illumination, and pose when generating face images, then study their impacts on recognition performance. Besides, I also proposed the identity mixup which operates on the identity coefficient level to alleviate the gap between models trained with natural and synthetic images. Our work is named SynFace, which was accepted by ICCV2021.

Many thanks for your work, which really inspired me :+1: