Closed iperov closed 1 year ago
Jesus man, these masks are awesome. They are like dot arts. Please tell me how did you get it, seriously. What's the training data? How many iters. did you train on? I would really like to reproduce the result.
In terms of fakeswap, I did not look into the detail of the recent GAN update, but I believe its the same with my implementation.
I have not seen such masks before, but I guess it might has something to do with lacking diversity as well as not enough training data. However, the sharpness can be eliminated to a certain level by GaussianBlur during video making, so that's not a big deal in my opinion.
I just built https://github.com/deepfakes/faceswap latest release + 155 pull request.
I built windows version with cmd batches for myself, you can download it from here test footage as on picture included in torrent https://github.com/deepfakes/faceswap-playground/issues/39
I've downloaded the zip file you mentioned above through torrent. Looking through the zip file, I only found 2 mp4 videos (to be extract as training data?), each < 30 sec.. So I think such awesome masks are caused by too little training data. The model is heavily overfit such that one of preview mask column is purely white (which means the model believe it can reconstruct input image by 100% accuracy).
652 photos not enough for GAN ? then how many?
actually white column is not purely white:
LOL, love this gif. It shows that what we see is actually not what we see.
Are these 652 images come from the 2 videos (data_dst.mp4 and data_src.mp4)? If yes, then its definitely not enough since the extracted faces will look the same (they are under same lighting condition).
In my experiment, its better to have more than 1k image from various (>=3 at least) video sources. And the extraction has fps <5 (i.e., extract < 5 faes/s from the video)so that there will not be too much duplicate images.
However, you can still try a technique called transfer learning even if you have little training data. To be specific, we first train our model using other face dataset (e.g., Emma Watson, Donald Trump, celebA etc. whatever you can find). Once the pre-trained model shows good results (say, after 20k iters), then we train it using these 652 images.
But really, have you tried to make video using your current model? I'm imaging it produce faces with a lot of jitters. But considering how much face detail it generate in the previews, It might produces interesting results in video.
data_src.mp4 - 654 images
so you mean GAN model not like a lot of similar images ?
Ahh that make sense. A 27-sec-long video with avg. 25fps. 27 x 25 is about 654. You probably want to add more images from other sources.
Since I found extraction < 3 fps be more effective. The information these 654 images provide to the model is probably not much different from less than 100 images.
result in current stage
I think the result can be much better if the face bounding box is more stable. As we can see, the bbox size and position in the gif above is not smooth so it results in severe jitters. We can smooth the bbox position being a moving average of previous frames. However, as far as I know there is no such functionality in faceswap (or maybe its still a wip).
What I want to say is that, don't be desperate about the current result. It can be improved if proper trick is introduced.
so there is no point to train GAN model in https://github.com/deepfakes/faceswap ?
The GAN (and non-GAN as well) will work if we have enough training data.
I'll tried your dataset and do some experiment tomorrow, hope I can make some breakthrough.
non-GAN works much better with same footage
So may be deepfakes/faceswap did something wrong when ported your model?
I don't think there are much difference between mine implementation and faceswap's. Perhaps GAN is just not as data efficient as non-GAN.
@shaoanlu I think the GAN + perceptual loss is very promising, But I've never been convinced that the masking doesn't degrade down to edge detection.
Have you considered training against a mask derived from the convex hull of the landmarks with the mask scaled down or multiplied by the rgb-face-loss*sigmoid to allow the network to reduce the mask only if it's beneficial to face reproduction?
The landmarks will be detected no matter there is occlusion on faces or not. My understanding is that if we use convex hull mask as supervised ground truth, the model will overfit so that it can not handle occlusion anymore. On the contrast, I believe the mask is able to learn more or less semantic segmentation, as shown in LR-GAN. But we have to find better architecture and loss function for masking on face swapping task.
The landmarks will be detected no matter there is occlusion on faces or not.
Yeah hence the need to give the network a 'loophole' by biasing the loss using rgb-face-loss*sigmoid, if the face is being generated well the loss will be near zero and we won't care about adjustments to the masking layer.
If the generation is poor however the rgb-face-loss*sigmoid is high and we do rigidly enforce the ground truth mask.
Can you elaborate more on the rgb-face-loss*sigmoid?
So we have the rgb-face-loss which represents purely the accuracy of the reproduction of the unmasked face.
The output of that is simply passed through some mapping function, perhaps sigmoid perhaps something harder, to force high losses towards 0 and low losses towards 1, then it becomes suitable for multiplication with the mask loss enforcing the full loss against the ground truth mask when the face is accurate but allowing the mask to take on any value when the face is innacurate.
If an occlusion appears it returns a high rgb-face-loss as an unexpected feature to appear in a face, this is forced low by the rgb-face-loss*sigmoid which then when multiplied with the mask loss it biases it towards zero which allows the mask output to move away from the ground-truth-mask.
For normal areas of the face without occlusion, rgb-face-loss is low which is forced high, passing the full loss back to the mask.
I think this is viable. It would be good if someone can run some experiment for prove of concept.
Here are the results I got after training for ~10k iters (4 hrs on K80) with perceptual loss. The masked faces are fairly good. So what I've learned is that, as long as perceptual loss is introduced during training, we probably don't have to prepare training data as much as I though it should.
However, the generated mask can not deal with hard sample, e.g., faces with motion blurs at the bottom left of figures above. As a result, the output video quality on non-stationary face is sub-optimal.
We can see there is one quick moment that it failed to transform the face to Downey. On the contrary, the mouth is preserved so that we know that he is shouting "just do it!".
But it is somehow eliminated after color correction. (Update: It's not eliminated, rather it's just becoming less discernible)
how I can get such masks in faceswap ?
Yet another results trained for 15k without perceptual loss. The model is not exactly the same with v2 model but the modifications should have little impact on output quality theoretically.
In the light of my recent experiments, I think the sharp masks shown in your post are due to too long of training so that the model is heavily overfit.
Is perceptional loss implemented in faceswap?
no
Hey shaoanlu, I noticed you mentioned color correction above
But it is somehow eliminated after color correction.
What do you do to produce color correction? Meaning, what software or, perhaps, what code? As I'm getting into this I'm finding my subject faces have such different skin tones, and I'm concerned it'll look like rubbish when I'm done.
Color correction is implemented through histogram matching. You can find the piece of code in v2_test_video notebooks.
The reason I did not include perceptual loss into faceswap is because it requires keras >2.1.1 and more VRAM. I worried that it will lead to lots of issues posts on the repo., thus I decide to remove it.
is histogram matching implemented in faceswap?
Probably not, I added it recently. However, I did not follow updates of faceswap closely so maybe there are some similar works been done.
Nope. Repo is pretty idle now.
About checkerboard artifacts https://distill.pub/2016/deconv-checkerboard/
dots with oversharped image is this normal , or faceswap developers adapted your plugin with errors?