shaoanlu / faceswap-GAN

A denoising autoencoder + adversarial losses and attention mechanisms for face swapping.
3.37k stars 843 forks source link

is this normal GAN masks in FaceSwap ? #50

Closed iperov closed 1 year ago

iperov commented 6 years ago

dots with oversharped image is this normal , or faceswap developers adapted your plugin with errors?

36072789-8173ea42-0f3f-11e8-8bc2-b2b8d8e9803b 1

shaoanlu commented 6 years ago

Jesus man, these masks are awesome. They are like dot arts. Please tell me how did you get it, seriously. What's the training data? How many iters. did you train on? I would really like to reproduce the result.

In terms of fakeswap, I did not look into the detail of the recent GAN update, but I believe its the same with my implementation.

I have not seen such masks before, but I guess it might has something to do with lacking diversity as well as not enough training data. However, the sharpness can be eliminated to a certain level by GaussianBlur during video making, so that's not a big deal in my opinion.

iperov commented 6 years ago

I just built https://github.com/deepfakes/faceswap latest release + 155 pull request.

I built windows version with cmd batches for myself, you can download it from here test footage as on picture included in torrent https://github.com/deepfakes/faceswap-playground/issues/39

shaoanlu commented 6 years ago

I've downloaded the zip file you mentioned above through torrent. Looking through the zip file, I only found 2 mp4 videos (to be extract as training data?), each < 30 sec.. So I think such awesome masks are caused by too little training data. The model is heavily overfit such that one of preview mask column is purely white (which means the model believe it can reconstruct input image by 100% accuracy).

iperov commented 6 years ago

652 photos not enough for GAN ? then how many?

iperov commented 6 years ago

actually white column is not purely white: 2018-02-11_17-01-54

shaoanlu commented 6 years ago

LOL, love this gif. It shows that what we see is actually not what we see.

Are these 652 images come from the 2 videos (data_dst.mp4 and data_src.mp4)? If yes, then its definitely not enough since the extracted faces will look the same (they are under same lighting condition).

In my experiment, its better to have more than 1k image from various (>=3 at least) video sources. And the extraction has fps <5 (i.e., extract < 5 faes/s from the video)so that there will not be too much duplicate images.

However, you can still try a technique called transfer learning even if you have little training data. To be specific, we first train our model using other face dataset (e.g., Emma Watson, Donald Trump, celebA etc. whatever you can find). Once the pre-trained model shows good results (say, after 20k iters), then we train it using these 652 images.

But really, have you tried to make video using your current model? I'm imaging it produce faces with a lot of jitters. But considering how much face detail it generate in the previews, It might produces interesting results in video.

iperov commented 6 years ago

data_src.mp4 - 654 images

so you mean GAN model not like a lot of similar images ?

shaoanlu commented 6 years ago

Ahh that make sense. A 27-sec-long video with avg. 25fps. 27 x 25 is about 654. You probably want to add more images from other sources.

Since I found extraction < 3 fps be more effective. The information these 654 images provide to the model is probably not much different from less than 100 images.

iperov commented 6 years ago

result in current stage 2018-02-11_17-30-07

shaoanlu commented 6 years ago

I think the result can be much better if the face bounding box is more stable. As we can see, the bbox size and position in the gif above is not smooth so it results in severe jitters. We can smooth the bbox position being a moving average of previous frames. However, as far as I know there is no such functionality in faceswap (or maybe its still a wip).

What I want to say is that, don't be desperate about the current result. It can be improved if proper trick is introduced.

iperov commented 6 years ago

so there is no point to train GAN model in https://github.com/deepfakes/faceswap ?

shaoanlu commented 6 years ago

The GAN (and non-GAN as well) will work if we have enough training data.

I'll tried your dataset and do some experiment tomorrow, hope I can make some breakthrough.

iperov commented 6 years ago

non-GAN works much better with same footage

2018-02-11_18-02-54

So may be deepfakes/faceswap did something wrong when ported your model?

shaoanlu commented 6 years ago

I don't think there are much difference between mine implementation and faceswap's. Perhaps GAN is just not as data efficient as non-GAN.

dfaker commented 6 years ago

@shaoanlu I think the GAN + perceptual loss is very promising, But I've never been convinced that the masking doesn't degrade down to edge detection.

Have you considered training against a mask derived from the convex hull of the landmarks with the mask scaled down or multiplied by the rgb-face-loss*sigmoid to allow the network to reduce the mask only if it's beneficial to face reproduction?

shaoanlu commented 6 years ago

The landmarks will be detected no matter there is occlusion on faces or not. My understanding is that if we use convex hull mask as supervised ground truth, the model will overfit so that it can not handle occlusion anymore. On the contrast, I believe the mask is able to learn more or less semantic segmentation, as shown in LR-GAN. But we have to find better architecture and loss function for masking on face swapping task.

dfaker commented 6 years ago

The landmarks will be detected no matter there is occlusion on faces or not.

Yeah hence the need to give the network a 'loophole' by biasing the loss using rgb-face-loss*sigmoid, if the face is being generated well the loss will be near zero and we won't care about adjustments to the masking layer.

If the generation is poor however the rgb-face-loss*sigmoid is high and we do rigidly enforce the ground truth mask.

shaoanlu commented 6 years ago

Can you elaborate more on the rgb-face-loss*sigmoid?

dfaker commented 6 years ago

So we have the rgb-face-loss which represents purely the accuracy of the reproduction of the unmasked face.

The output of that is simply passed through some mapping function, perhaps sigmoid perhaps something harder, to force high losses towards 0 and low losses towards 1, then it becomes suitable for multiplication with the mask loss enforcing the full loss against the ground truth mask when the face is accurate but allowing the mask to take on any value when the face is innacurate.

If an occlusion appears it returns a high rgb-face-loss as an unexpected feature to appear in a face, this is forced low by the rgb-face-loss*sigmoid which then when multiplied with the mask loss it biases it towards zero which allows the mask output to move away from the ground-truth-mask.

For normal areas of the face without occlusion, rgb-face-loss is low which is forced high, passing the full loss back to the mask.

shaoanlu commented 6 years ago

I think this is viable. It would be good if someone can run some experiment for prove of concept.

shaoanlu commented 6 years ago

Here are the results I got after training for ~10k iters (4 hrs on K80) with perceptual loss. The masked faces are fairly good. So what I've learned is that, as long as perceptual loss is introduced during training, we probably don't have to prepare training data as much as I though it should.

0218_masked_pl 0218_mask_pl

However, the generated mask can not deal with hard sample, e.g., faces with motion blurs at the bottom left of figures above. As a result, the output video quality on non-stationary face is sub-optimal.

shia_downey

We can see there is one quick moment that it failed to transform the face to Downey. On the contrary, the mouth is preserved so that we know that he is shouting "just do it!".

shia_downey_ccr

But it is somehow eliminated after color correction. (Update: It's not eliminated, rather it's just becoming less discernible)

iperov commented 6 years ago

how I can get such masks in faceswap ?

shaoanlu commented 6 years ago

Yet another results trained for 15k without perceptual loss. The model is not exactly the same with v2 model but the modifications should have little impact on output quality theoretically.

just_do_it_masked_wopl just_do_it_mask_wopl

In the light of my recent experiments, I think the sharp masks shown in your post are due to too long of training so that the model is heavily overfit.

Jack29913 commented 6 years ago

Is perceptional loss implemented in faceswap?

iperov commented 6 years ago

no

MisterGenerosity commented 6 years ago

Hey shaoanlu, I noticed you mentioned color correction above

But it is somehow eliminated after color correction.

What do you do to produce color correction? Meaning, what software or, perhaps, what code? As I'm getting into this I'm finding my subject faces have such different skin tones, and I'm concerned it'll look like rubbish when I'm done.

shaoanlu commented 6 years ago

Color correction is implemented through histogram matching. You can find the piece of code in v2_test_video notebooks.

The reason I did not include perceptual loss into faceswap is because it requires keras >2.1.1 and more VRAM. I worried that it will lead to lots of issues posts on the repo., thus I decide to remove it.

iperov commented 6 years ago

is histogram matching implemented in faceswap?

shaoanlu commented 6 years ago

Probably not, I added it recently. However, I did not follow updates of faceswap closely so maybe there are some similar works been done.

Jack29913 commented 6 years ago

Nope. Repo is pretty idle now.

mrgloom commented 5 years ago

About checkerboard artifacts https://distill.pub/2016/deconv-checkerboard/