Closed Neltherion closed 6 years ago
paper will be released in august. I dont think it is easy to implement from scratch without a bunch of pretrained estimator models, which not exists in free access. Also I think minimum GPU vram req for such result is 16GB.
Paper has been released https://arxiv.org/pdf/1805.11714.pdf
GeForce GTX Titan Xp (12 GB RAM) has been used
@Apollo122 are you understand implementation ?
Havent had a chance to study it but It looks like network has 8 downsample and 8 upsample modules. What I like the most it doesn't need large amount of frames to train. If this is true then color correction is no longer an issue. And 256x256 took 10 hours and 512x512 took 42 hours to train. From the paper:
Typically, two thousand video frames, i.e., about one minute of video footage, are suicient to train our network (see Section 7). ... We train our networks using the Tensor- Flow [Abadi et al. 2015] deep learning framework. The gradients for back-propagation are obtained using Adam [Kingma and Ba 2015]. We train for 31,000 iterations with a batch size of 16 (approx. 250 epochs for a training corpus of 2000 frames) using a base learn- ing rate of 0.0002 and irst momentum of 0.5; all other parameters have their default value. We train our networks from scratch, and initialize the weights based on a Normal distribution N(0, 0.2)
not from scratch
First, we track the source and target actor using a state-of-the-art monocular face reconstruction approach that uses a parametric face and illumination model
Parametric Face Representation. We represent the space of facial identity based on a parametric head model [Blanz and Vetter 1999], and the space of facial expressions via an aine model.
Difuse skin relectance is modeled similarly by a second affine model r∈R 3N that stacks the difuse per-vertex albedo:
The geometry basis {bgeok} Nαk=1 has been computed by applying principal component analysis (PCA) to 200 high-quality face scans [Blanzand Vetter 1999].
The relectance basis {brefk}Nβk=1 has been obtained in the same manner.
For dimensionality reduction, the expression basis {bexpk}Nδ k=1 has been computed using PCA, starting from the blendshapes of Alexander et al. [2010] and Cao et al. [2014b].
Their blendshapes have been transferred to the topology of Blanz and Vetter [1999] using deformation transfer [Sumner and Popović 2004].
looks like it required at least 4 models-estimators. So this paper is nothing.
No wonder 8 people worked on this from different countries!
, because NN cannot recognize from scratch head pose, facial expressions, diffuse skin, illumination, etc, without human pointing on such params. So IMHO, estimator models are required.
Maximum what I got from non GAN model: https://coub.com/view/1954x3
wow @shaoanlu
This is not an issue and also irrelative to faceswap-GAN project.
As I understand it need some high quality 3DMM as input (maybe https://github.com/cleardusk/3DDFA can be used) as coarse approximation.
Their correspondence image
look like PCCN
https://raw.githubusercontent.com/cleardusk/3DDFA/master/samples/demo_pncc_paf.jpg
Also segmentation model is needed for eyes (Eye and Gaze Map).
As I understand result will be highly dependent on quality of 3DMM.
I was wondering what your opinion is about Deep Video Portraits?