Open zgxiangyang opened 4 years ago
The extracted coef's shape is (257,0), while the output shape of z_to_lambda_mapping is (254)。What is the relationship between them?
In face image generation we eliminate the 3D world translation (assume it to be zero) for a face. So our input to the generator consists only of identity, expression, pose, and lighting in a total of 254 dimensions, while the coefficients extracted from the 3D reconstruction network also contain translation (3 dimensions) so it is in the shape of 257.
Hi YuDeng
Impressive work!
I am trying to achieve non-identity factors (including expression, illumination, and pose) transfer of real images, but I get unexpected results that do not preserve the identity information. Specifically, my target is to manipulate source image with the expression, illumination, and pose of the reference image while keeping identity unchanged. Here are my source and reference images.
My process is kind of simple and naive.
Results with different noises
It looks like the generated images catch the expression, illumination, and pose of the reference image, but they do not preserve the identity information of the source image. Does the gap between the coefficient from R_Net and the coefficient learned by your network lead to this problem?
Or I have to do manipulations in W+ space as you mentioned in your paper, but how? I have noticed your answer here (https://github.com/microsoft/DisentangledFaceGAN/issues/9#issuecomment-658024542). Does it mean that if I want to manipulate one image, I need to utilize backprop to obtain its latent code first, and then vary this code to achieve manipulation? If so, do you have any convenient method to accelerate this process?
@haibo-qiu. Hi, sorry that I have not noticed this issue until now. Hope that my answer is not too late.
As you mentioned, there is a gap between the extracted identity coefficients from R_Net and the identity coefficients received by the generator during training. The generator is trained with coefficients sampled from a VAE which does not faithfully capture the original distribution of real identity coefficients. As a result, if you extract identity of a real image and give it to the generator, most likely the generator will produce an image with different identity.
To alleviate this problem, a better way is to embed a real image into the W+ space and modify it in that space, which is just what we did in our paper. However, the optimization process will take several minutes for each image which is quite slow. Besides, embedding a real image into W+ space is risky in that W+ space is not as good as W space and Z space which guarantee disentanglement between different semantic factors.
Currently there are some method trying to embed real images into GAN's latent space that can faithfully reconstruct the input meanwhile provide reasonable semantic editing, for example In-Domain GAN Inversion for Real Image Editing. I think these papers might help in your case.
Hi @YuDeng,
Thanks for your kind reply : )
When I realized the gap in identity coefficient and the risk of W+ space, I changed my thought to explore using synthetically generated face images for face recognition.
With your DiscoFaceGAN model, I can control several factors like expression, illumination, and pose when generating face images, then study their impacts on recognition performance. Besides, I also proposed the identity mixup which operates on the identity coefficient level to alleviate the gap between models trained with natural and synthetic images. Our work is named SynFace, which was accepted by ICCV2021.
Many thanks for your work, which really inspired me :+1:
Could you please provide the demo code of transferring expression of a real image to a generated image.