shrubb / latent-pose-reenactment

The authors' implementation of the "Neural Head Reenactment with Latent Pose Descriptors" (CVPR 2020) paper.
https://shrubb.github.io/research/latent-pose-reenactment/
Apache License 2.0
181 stars 34 forks source link

Question about preprocessing Voxceleb2 #27

Closed khlee369 closed 3 years ago

khlee369 commented 3 years ago

hello shrubb!

Thanks for sharing your research and code. I research Neural Talking Head based on your works, and it will be helpful if you give me some advice.

I think it will be an extension of #6 When I downloaded the video files provided by official VoxCeleb2(https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html), there was a problem that the forehead was cropped as follows.

image

6 showed the results ad follows.

image

In the paper, preprocessing about the data is mentioned as follow(https://arxiv.org/pdf/2004.12000.pdf)

Our training dataset is a collection of YouTube videos from VoxCeleb2 [4]. There are on the order of 100,000 videos of about 6,000 people. We sampled 1 of every 25 frames from each video, leaving around seven million of total training images. In each image, we re-cropped the annotated face by first capturing its bounding box with the S3FD detector [43], then making that box square by enlarging the smaller side, growing the box’s sides by 80% keeping the center, and finally resizing the cropped image to 256 × 256.

When using Voxceleb2 data, did you download the video directly through the youtube url and pre-process it instead of using the video file provided by the official? If so, could you please share the data?

shrubb commented 3 years ago

Hi,

6 showed the results ad follows.

Looks correct. Yes, there's a problem with forehead reflections. We had it. It still happened even when we used the original videos (though wasn't as severe as in your picture).

When using Voxceleb2 data, did you download the video directly through the youtube url and pre-process it instead of using the video file provided by the official?

Yes.

If so, could you please share the data?

Sorry, I don't have access to that data anymore. Even if I did, it's so huge that I don't know how I could share it anyway... However, you can download most of it from YouTube yourself.

khlee369 commented 3 years ago

Thank you for your reply! It is really helpful for me. I have two more questions about pre-processing.

I think when you use a face detector (S3FD) for pre-processing voxceleb2, it gives you some kind of effect as the face is aligned.(and I guess this is the reason why images are shaking shown at #3)

but, when I check FFHQFaceCropper utils/crop_as_in_dataset.py, there is an explicit face alignment process. But the default setting for FFHQ alignment is False in utils/preprocess_dataset.sh

So my questions are,

  1. when you preprocess voxcleb2 data after downloading the video directly through youtube, did you do face-alignment like FFHQFaceCropper? (i can not find anything about face alignment in the paper)

  2. And do you have any idea how face-alignment affects to performance of neural talking head

shrubb commented 3 years ago

did you do face-alignment like FFHQFaceCropper?

No, we used LatentPoseFaceCropper.

do you have any idea how face-alignment affects to performance of neural talking head

FFHQFaceCropper may be a liiiiittle bit better because some parts of the face stay at roughly same pixels in the image in various poses. On the other hand, LatentPoseFaceCropper's shakiness brings more robustness because it can be considered as data augmentation. So, it's a broad question for future research.

khlee369 commented 3 years ago

감사합니다 shrubb! Thank you shrubb!