Universal face encoder, more resilient decoders?

gradient-dissenter commented 6 years ago

First: thanks for this useful tool! I'm in the process of learning ML, but reading through this project has helped greatly.

Please correct me if I'm wrong, but I see that each encoder model is intended to be specific to a pair of faces. In my experience a given encoder adapts very rapidly to the addition of a new face set. And you may reuse decoders for a given face as these are only trained with already known data.

Is there any effort to make a universal face encoder, a very large, well-trained model that can be shared publicly? This could cut down on training time and produce arbitrary A->B swaps, where the generator for B is already trained. Perhaps this would require more layers or training time, but would it be possible to leverage the features extracted from layers of frozen existing image classification networks, or networks specifically trained on detecting facial orientation or expressions.

One example I'm thinking of is MSG-Net (https://github.com/zhanghang1989/MSG-Net), which extracts VGG-16 based features to train a model for a large set of artistic styles, but also includes a separate 'inspiration' layer for a given style.

As for the decoders: with current techniques I understand that the interpretation of the abstract face vector and the resulting transformations for a particular face set must be baked into the weights of a decoder model. I've seen some quirky decoder behavior when faces A and B are very different, though. In this case, is it possible to tweak the parameters related to the distortion / warping of the training images? While generating the training data for face A, It might also be interesting if there were some way to automatically swap eyes, mouths, etc. (using opencv, face_recognition), drawing from a large training set of these features from other faces, so the decoder for face A isn't just practicing with variations of A's eyes, for instance.

Maybe it's not helpful to throw ideas out without concrete action toward implementation, but please let us know if there are any helpful experiments we can run.

shaoanlu commented 6 years ago

I've tried VGGFace ResNet50 as encoder, the convergence is faster. But I also found that features extracted from high level layers are too abstract, e.g., not aware of nuanced face expression (smiling w/ mouth open/close).

For partial face swapping, I have some preliminary result as below:

However, the most difficult part lies in finding a proper face B image that has similar attribute with a given face A, so that the swapped eyes/mouth look natural. I used encoder's embedding as KNN's input to find the most similar face B image, but the output quality is not satisfying.

shaoanlu commented 6 years ago

I've completed further experiment on partial face swapping as a data augmentation. Unfortunately, the model also learns to generates artifacts, e.g., sharp edges around eyes/nose and weirdly warped face. These artifacts are similar to those appear in non-perfect augmented images (caused by false landmarks and bad perspective warping).

gradient-dissenter commented 6 years ago

Interesting, thanks for sharing these results which develop our intuition. It makes sense that VGGNet on its own doesn't extract an ideal set of features for generating new faces, contrasted with your encoder specifically trained to do just this. Now I'm wondering whether a pretrained head orientation estimation network might be helpful as (1) an additional input to the trainable Encoder or (2) something concatenated with the trainable Encoder's output. Could this cut down on the convergence time or help 'nudge' the training in the right direction?

Also, to overcome the 128x128 resolution limit and blurry closeups, have you considered or tried existing superresolution networks? Maybe it would hurt performance, but I bet a superresolution network trained specifically on the datasets (e.g. 128x128 => 512x512) would yield good results, since the training images wouldn't differ much from the GAN output. https://github.com/tetrachrome/subpixel / https://github.com/alexjc/neural-enhance

shaoanlu commented 6 years ago

I've tried adding landmark as the 4th input channel, but it deteriorate masking for occlusion (results and descriptions can be found here).

I did not follow super resolution (SR) much so I'm not sure if it is viable for refining output face. But I think a transfer learning approach (using celebA pre-trained SR models for refinement) is worth a try. On the other hand, I'm also worried about the memory usage of SR networks. i.e., I don't know if we can run GAN model and SR model at the same time w/o having OOM error. Splitting the video-making pipeline into 2-stage seems too cumbersome for me.

shaoanlu / faceswap-GAN

Universal face encoder, more resilient decoders? #32