Deep Video Portraits - Githubissues

Neltherion commented 6 years ago

I was wondering what your opinion is about Deep Video Portraits?

iperov commented 6 years ago

paper will be released in august. I dont think it is easy to implement from scratch without a bunch of pretrained estimator models, which not exists in free access. Also I think minimum GPU vram req for such result is 16GB.

Jack29913 commented 6 years ago

Paper has been released https://arxiv.org/pdf/1805.11714.pdf

GeForce GTX Titan Xp (12 GB RAM) has been used

iperov commented 6 years ago

@Apollo122 are you understand implementation ?

Jack29913 commented 6 years ago

Havent had a chance to study it but It looks like network has 8 downsample and 8 upsample modules. What I like the most it doesn't need large amount of frames to train. If this is true then color correction is no longer an issue. And 256x256 took 10 hours and 512x512 took 42 hours to train. From the paper:

Typically, two thousand video frames, i.e., about one minute of video footage, are suicient to train our network (see Section 7). ... We train our networks using the Tensor- Flow [Abadi et al. 2015] deep learning framework. The gradients for back-propagation are obtained using Adam [Kingma and Ba 2015]. We train for 31,000 iterations with a batch size of 16 (approx. 250 epochs for a training corpus of 2000 frames) using a base learn- ing rate of 0.0002 and irst momentum of 0.5; all other parameters have their default value. We train our networks from scratch, and initialize the weights based on a Normal distribution N(0, 0.2)

iperov commented 6 years ago

not from scratch

First, we track the source and target actor using a state-of-the-art monocular face reconstruction approach that uses a parametric face and illumination model

Parametric Face Representation. We represent the space of facial identity based on a parametric head model [Blanz and Vetter 1999], and the space of facial expressions via an aine model.

Difuse skin relectance is modeled similarly by a second affine model r∈R 3N that stacks the difuse per-vertex albedo:

The geometry basis {bgeok} Nαk=1 has been computed by applying principal component analysis (PCA) to 200 high-quality face scans [Blanzand Vetter 1999].

The relectance basis {brefk}Nβk=1 has been obtained in the same manner.

For dimensionality reduction, the expression basis {bexpk}Nδ k=1 has been computed using PCA, starting from the blendshapes of Alexander et al. [2010] and Cao et al. [2014b].

Their blendshapes have been transferred to the topology of Blanz and Vetter [1999] using deformation transfer [Sumner and Popović 2004].

looks like it required at least 4 models-estimators. So this paper is nothing.

Neltherion commented 6 years ago

No wonder 8 people worked on this from different countries!

iperov commented 6 years ago

, because NN cannot recognize from scratch head pose, facial expressions, diffuse skin, illumination, etc, without human pointing on such params. So IMHO, estimator models are required.

iperov commented 6 years ago

Maximum what I got from non GAN model: https://coub.com/view/1954x3

iperov commented 6 years ago

wow @shaoanlu

shaoanlu commented 6 years ago

This is not an issue and also irrelative to faceswap-GAN project.

mrgloom commented 5 years ago

As I understand it need some high quality 3DMM as input (maybe https://github.com/cleardusk/3DDFA can be used) as coarse approximation. Their correspondence image look like PCCN https://raw.githubusercontent.com/cleardusk/3DDFA/master/samples/demo_pncc_paf.jpg Also segmentation model is needed for eyes (Eye and Gaze Map).

As I understand result will be highly dependent on quality of 3DMM.

shaoanlu / faceswap-GAN

Deep Video Portraits #101