johannwyh / HifiFace

This is the official project website of RealFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping
59 stars 4 forks source link

[Recommended] Questions on Re-implementation #3

Open quqixun opened 2 years ago

quqixun commented 2 years ago

您好,我在复现您的这篇论文,在复现过程中遇到了以下问题:

  1. 从数据集Celebrity-Asian和VggFace2中裁剪人脸,存在很多模糊人脸,请问数据集中的模糊数据您是如何处理的,您最终用于训练模型的数据量大概是多少,这部分在论文中没有详细提及;

  2. Feature-Level最后一段中:

    After the feature-level fusion, we generate Ilow to com-
    pute auxiliary loss for better disentangling the identity and
    attributes. Then we use a 4× Upsample Module Fup which
    contains several res-blocks to better fuse the feature maps.
    Based on Fup , it’s convenient for our HifiFace to generate
    even higher resolution results (e.g., 512 × 512).

    Experiments中:

    For our more
    precise model (i.e., Ours-512), we adopt a portrait enhance-
    ment network [Li et al., 2020] to improve the resolution of
    the training images to 512×512 as supervision, and also cor-
    respondingly add another res-block in Fup of SFF compared
    to Ours-256. 

    再结合您在issue2中的回答,您看我下面的理解是否正确: 2.1 Ours-256 和 Ours-512 模型输入的 It 均为256尺寸图像; 2.2 在 Ours-256 模型中,在获得 zfuse 后,先过两层 AdaIN Res-Block,再过两层有upsample的Res-Block的Fup; 2.3 Ours-512 与 Ours-256 相比,区别仅在于,在Ours-512模型的 Fup模块中多一层有upsample的Res-Block;

  3. 其他问题: 3.1 获得 Mlow 、Mr、Ilow、Ir过程中,输出层结构是怎样的; 3.2 使用3D人脸模型获得人脸关键点后,其坐标范围为[0, 224],是否将其转换为[0, 1]范围; 3.3 在计算 Lcyc 时,cyc过程输出G(Ir , It),Ir是否有detach;训练过程中发现Lcyc 特别低,相比其他loss差一到两个数量级; 3.4 训练过程中是怎样做数据采样的; 3.5 在辨别器中使用的也是Res-Block,Res-Block中使用的是InstanceNorm2D,这里请您确认在辨别器中使用的是InstanceNorm2D; 3.6 能否列举SFF的详细结构; 3.7 HRNet的人脸分割效果比较差,是否做了其他优化; 3.8 是否是分阶段训练的,还是在一开始辨别器就参与了训练; 3.9 脸型差异较大时,生成结果存在双下巴现象,这种现象是否是靠辨别器抑制掉的。

问题较多,期待回复,谢谢。

johannwyh commented 2 years ago

Thank you for your affection for HifiFace, this project is still undergoing the approval process of open source. Thank you for your patience. Additionally we sincerely hope you can raise your question again in English and these information can be shared by more of this community. Here I will answer your questions. Feel free to keep in touch.

  1. We stated our data cleaning process in the Implementation Details part of 4. Experiments

    For our model with resolution 256 (i.e., Ours-256), we remove images with either size smaller than 256 for better image quality.

The size of Celebrity-Asian is roughly 680k, while that of 'VGG-Face2' is 640k

    1. Exactly. Ours-256 and Ours-512 use both 256 input.
    2. Exactly.
    3. Exactly. By the way, we use an enhancement model to create pairs for same-identity samples.

Feedbacks for question 3 are listed below.

johannwyh commented 2 years ago

For questions in part 3,

  1. We generate I and M both from feature maps of corresponding size, where for

    • I, feature map goes through a LeakyReLU(0.2) activation and a Conv Layer sequentially, ranges in [-1, 1]
    • M, feature map goes through a Conv Layer and a Sigmoid Layer sequentially, ranges in [0, 1]
  2. We do not transform it. The loss is adjusted by parameters.

  3. Your findings are correct. We do not detach I_r and the loss value is relatively small.

johannwyh commented 2 years ago

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.
quqixun commented 2 years ago

Thanks a lot for your response. It is very helpful to understand the paper better.

Some questions about dataset:

Some questions about training:

Some questions about implementation:

johannwyh commented 2 years ago

About Section 5

  1. Yes
  2. I am sorry that I cannot make direct suggestions about the situation you encountered. But I can give you two hints that might ease your stress.
    • Our Lshape is clamped to (0, 10). Otherwise at early stages extreme face shape difference might cause training collapse.
    • At early stages (about <5000 iters), the generator will "learn" to generate an image exactly the same as the target. This will keep for some iters and than the loss will force the generator to give it up.
    • Hope these information might help.
  3. Some losses only make sense when the source and target have the same identity
    • Llpips can only be calculated when source and target are of the same person
    • Lcyc can be applied to different identity pairs. You can analyze the generation logic yourself.

About Section 6

johannwyh commented 2 years ago

About Section 4

yfji commented 2 years ago

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and Questions of SFF #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.

What is the effect to the generator and the synthesis result whether to use InstanceNorm in Dicriminator? I have this question for a long time. In my own re-implementation, I found the loss of Discriminator always very high without InstanceNorm, and the synthesis is terrible. When I added InstanceNorm, the result is OK. Besides, have you tried to use PatchGAN?

johannwyh commented 2 years ago

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and Questions of SFF #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.

What is the effect to the generator and the synthesis result whether to use InstanceNorm in Dicriminator? I have this question for a long time. In my own re-implementation, I found the loss of Discriminator always very high without InstanceNorm, and the synthesis is terrible. When I added InstanceNorm, the result is OK. Besides, have you tried to use PatchGAN?

Sorry for a late reply.

It is quite weird that you encounter this issue. In fact, in our implementation, we do not use instance norm in the discriminator. In the generator, besides AdaIn, we use InstanceNorm in the encode and bottleneck part.

I wonder if your BP of D loss is correct. Remember to detach the generator part when BP the D loss.

yfji commented 2 years ago

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and Questions of SFF #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.

What is the effect to the generator and the synthesis result whether to use InstanceNorm in Dicriminator? I have this question for a long time. In my own re-implementation, I found the loss of Discriminator always very high without InstanceNorm, and the synthesis is terrible. When I added InstanceNorm, the result is OK. Besides, have you tried to use PatchGAN?

Sorry for a late reply.

It is quite weird that you encounter this issue. In fact, in our implementation, we do not use instance norm in the discriminator. In the generator, besides AdaIn, we use InstanceNorm in the encode and bottleneck part.

I wonder if your BP of D loss is correct. Remember to detach the generator part when BP the D loss.

Yeah, I fixed some tricks and the discriminator without IN worked now, but there seems no difference with the discriminator using IN. I really wonder the effect of IN on the discriminator, and single output v.s. PatchGAN. I'd be very happy to learn your opinion! Besides, I find the synthesis in videos are not as realistic as single images. In the videos, the faces are often suffered from color jittering and shaking between consequtive frames (I already used alpha-filtering in face alignment and color re-normalization). Do you have any suggestions? Thank you very much!

johannwyh commented 2 years ago

HifiFace is mainly researched on image based swapping instead of video. So it is normal that the model performs not that perfect in video, as we did not tune this.

If you need any help on video synthesis, feel free to drop me an email and maybe we can provide an official generation for you to compare with your result.

yfji commented 2 years ago

HifiFace is mainly researched on image based swapping instead of video. So it is normal that the model performs not that perfect in video, as we did not tune this.

If you need any help on video synthesis, feel free to drop me an email and maybe we can provide an official generation for you to compare with your result.

Hi, I've send an email to hififace.youtu@gmail.com, wish for your reply! Thanks!

gitlabspy commented 2 years ago

Hi, according to what you mentioned above, M_{low} is generated by z_{dec} passed through a conv layer with sigmoid act and I_{low} is generated by z_{dec} through a lrelu and a conv layer?

taotaonice commented 2 years ago

Your job is great and really nice. My re-implementation result seems to be OK. But I have some questions about re-implementation.

  1. Does the adv or Lpips loss apply on 64x64 level images?
  2. Is The GAN loss implemented by WGAN-GP?
  3. Does the adv loss apply on I_cyc?

And in my experimentation, PatchGAN seems to perform better. Hope for your nice answer!

johannwyh commented 2 years ago

Hi, according to what you mentioned above, M_{low} is generated by z_{dec} passed through a conv layer with sigmoid act and I_{low} is generated by z_{dec} through a lrelu and a conv layer?

Exactly

johannwyh commented 2 years ago

Your job is great and really nice. My re-implementation result seems to be OK. But I have some questions about re-implementation.

  1. Does the adv or Lpips loss apply on 64x64 level images?
  2. Is The GAN loss implemented by WGAN-GP?
  3. Does the adv loss apply on I_cyc?

And in my experimentation, PatchGAN seems to perform better. Hope for your nice answer!

  1. They are neither applied to 64x64 image in our implementation.
  2. The GAN loss is implemented as the raw GAN loss with log D trick (simply follow the setting of StarGAN v2). For discriminator loss, we apply gradient penalty.
  3. No. I_cyc is only used in cycle loss.

Thank you very much for your advice, we will definitely try more SOTA backbones for better results!

gitlabspy commented 2 years ago

Thx! I still have some questions:

  1. Face model generated with coeff in loss Lshape, does q_fuse (or qr) stand for face_shape in these lines of codes? https://github.com/sicxu/Deep3DFaceRecon_pytorch/blob/master/models/bfm.py#L86-L99
  2. You mentioned lpips loss makes sense only with two images of same person, so all data paired with the same person in the training set (source and target is the same person)?
johannwyh commented 2 years ago

Thx! I still have some questions:

  1. Face model generated with coeff in loss Lshape, does q_fuse (or qr) stand for face_shape in these lines of codes? https://github.com/sicxu/Deep3DFaceRecon_pytorch/blob/master/models/bfm.py#L86-L99
  2. You mentioned lpips loss makes sense only with two images of same person, so all data paired with the same person in the training set (source and target is the same person)?
  1. q_fuse is the landmark reconstructed from a fused embedding, using the face reconstruction model. landmark stands for those 17 facial contour points among 68 landmark points. fused embedding stands for an embedding that uses the source's shape id, and target's expr+pose id. The reconstruction process does not use textures.
  2. See Implementation Details. 50% training pairs are of the same identity, while the others are of the different. When it comes to the pairs of different identity, simply do not apply these losses to their results.
dypromise commented 2 years ago

Hi, author, thanks your great job! I have some questions about implementation details:

  1. Mask_gt in loss_G_seg you mentioned is dilated from mask of It image, How big is the dilated kernel?
  2. About loss_Rec & loss_lpips, we take loss_rec for example, you mentioned it is only ON when Is and It come from same identity, Yes, In theory it should be like this. But I have re-implement some papers like 'Faceshifter' or 'SimSwap', in their paper they actually said like this, but in fact, the result when a loss_rec always appear no matter I_t and I_s from same ID is better, Especially in maintaining the attributes of the I_t face, SO, in your implementation, DO YOU REALLY have loss_rec when pair come from same id , Or always have this loss?
  3. Did you normalize the 3D coeff before it concat to arcface embedding? or either norm two kinds of coeffs then concat?
Continue7777 commented 2 years ago

I try to re-implement the paper,but meet some troubles.Ask for help. First,after training for a long time, the mask tend to all empty(all 0).rec loss stay near 0. To fix that,i just use segmap loss , I find that high segLoss can work well but the low segloss is invalid. i think the segHigh tend to force the segLow to empty for a better effect. I think it establish a short path for high loss and longer and more difficult path for the low one.

Continue7777 commented 2 years ago

CDB533BB-0A89-448d-9FFF-B6AE2F9E0E8F

xuehy commented 1 year ago

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and Questions of SFF #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.

What is the effect to the generator and the synthesis result whether to use InstanceNorm in Dicriminator? I have this question for a long time. In my own re-implementation, I found the loss of Discriminator always very high without InstanceNorm, and the synthesis is terrible. When I added InstanceNorm, the result is OK. Besides, have you tried to use PatchGAN?

Sorry for a late reply. It is quite weird that you encounter this issue. In fact, in our implementation, we do not use instance norm in the discriminator. In the generator, besides AdaIn, we use InstanceNorm in the encode and bottleneck part. I wonder if your BP of D loss is correct. Remember to detach the generator part when BP the D loss.

Yeah, I fixed some tricks and the discriminator without IN worked now, but there seems no difference with the discriminator using IN. I really wonder the effect of IN on the discriminator, and single output v.s. PatchGAN. I'd be very happy to learn your opinion! Besides, I find the synthesis in videos are not as realistic as single images. In the videos, the faces are often suffered from color jittering and shaking between consequtive frames (I already used alpha-filtering in face alignment and color re-normalization). Do you have any suggestions? Thank you very much!

How did you apply color re-normalization? Is there any reference article or code?