harlanhong / CVPR2022-DaGAN

Official code for CVPR2022 paper: Depth-Aware Generative Adversarial Network for Talking Head Video Generation
https://harlanhong.github.io/publications/dagan.html
Other
957 stars 125 forks source link

How do I train the network with my own data? #43

Open pcmdrg opened 1 year ago

pcmdrg commented 1 year ago

Hi, First I want to thank you for providing the code. DaGAN works like magic.

Here is my issue: I'd like to create a video of guy with strong emotion, like screaming. I have the driving video, but the generated clip from DaGAN doesn't share the strong emotion as the driving video, the mouth only open slightly, unlike the wide open mouth in the driving video.

I thought it is the dataset problem: there are not many strong emotions from the voxceleb dataset, which consists of interview videos. I set out to train the model from scratch with the driving video (about 1500 face images). I use your resnet-50 depth encoder/decoder pretrained weights, and train my own generator, kp detector and discriminator. However, the results are horrible. The face doesn't even change expression.

My question is: 1. should I train from scratch or just fine-tune your model with my driving video? 2. When I train the network, I just input a bunch of face images of the same person, with different expression/head pose. Is this right? Does the "driving" and "source" frame has to be close together in the video (only slight expression/pose change)?

Thanks a lot!

harlanhong commented 1 year ago

1\ You could just fine-tune my model, and it should work. 2\ Your input is correct, the source and the driving face should be the same person during the training stage, and we don't have any constraints between source and driving (i.e. expression or pose).

Would you like to share your trained results here? so that I can get more information and give more appropriate suggestions.

pcmdrg commented 1 year ago

Thanks for your reply!

I fine-tuned the model (SPADEDepth...pth.tar) over the weekend with my own data. The results are still not good. Not sure how to share a video here but the output frames are basically same copies of source images with slight deformation. The mouth don't open at all. If I directly use SPADEDepth...pth.tar, the mouth/pose are correct, only problem is that the mouth doesn't open widely enough to show strong emotion.

A few more questions: 1. Do I need multiple GPUs to train? For the training I use one 3090 GPU with a batch size of 2 -- the largest batch size that is within the memory limit.

  1. I put all the face images into a folder, for example, tom_cruise/train/1, and put tom_cruise as the root_dir in the config file. Is this right?
  2. What will happen if the face images are from the same person but lighting is different (night/day)? Does the network consider them as two person?

Thanks again for your help!

harlanhong commented 1 year ago

1/ The batch size is an important hyper-parameters in network, it will affect the batchnorm layers. 2/ You can check the output of the dataset to see if your fetched data is correct. 3/ The network attempts to detect human faces and outputs their keypoints. if the lighting env is too hard to detect faces, the performance will be degraded. Actually, we don't consider the identity of input during training. We use two images with the same identity as input because we can treat the driving image as the ground truth to supervise the network learning.

pcmdrg commented 1 year ago

Thanks for your input! I double checked the output of the dataset and figure out the culprit: I forgot to resize the input image to 256 x 256! After I resized the images, the results improve a lot.

I have three more questions on the dataset.

  1. If I have two interview clips of Tom Cruise, and the two interviews have different lighting condition (but I believe both lightings are good enough for face detection). Should I put the faces into two separate folders under "train/", or put them in one folder? The reason I ask this is that in my experiment I put all the images into one folder, and output video flickers (change of shading in the face) with change of head pose. I suspect this is because the algorithm "treat driving image as the ground truth to supervise the network learning". So if the source and driving images have different lighting, and for the reason that DaGAN works by "warping" the source image using the motion field, the algorithm will learn to change lighting with head pose.

  2. For the same reason that DaGAN works by "warping" the source image using the motion field, I was wondering if the background will affect the final results. Since Voxceleb dataset is composed of interview clips, most of the background is fixed between frames. So the network will only need to learn to "warp" the face region. However, if the face dataset is from a movie with changing background, the network will also learn to "warp" both the face and the background, which may harm the quality of the generated face. Should I use a mask to mask out the face region?

  3. Can I include the face images when the face turn to one side (face profile)?

Thanks for the help!

harlanhong commented 1 year ago

Will go back to you soon. May I know your institution? and please do not use our project for commercial purposes without our permission.