Open GeneralShan opened 4 days ago
i am experimenting with training techniques of simswap this discriminator in use is patchGAN multi-scale discriminator loss forms are mentioning the simswap paper. they are ID LOSS, REC LOSS mainly and also WEAK FEATURE MATCHING loss for discriminator training
one more big issue: inswapper is align with arcface alignment, but simswap 512 and simswapPLUS, is align with FFHQ. All the best dataset is also FFHQ, so training fresh model it is best to use FFHQ.
Dataset can be VGGFace2-HQ from SIMSWAP authors.
@somanchiu can you tell me how did you determine architecture of STYLE BLOCK: https://github.com/somanchiu/ReSwapper/blob/main/StyleTransferModel_128.py#L87
This is in SIMSWAP paper as "IDENTITY INJECTION MODULE". How did you determine architecture here , and how close does it match the baseline inswapper architecture in NETRON?
@GeneralShan Thanks for reviewing my implementation and making some important PRs. I will review SimSwap as I haven't studied it deeply before.
As for the training code at the moment, you can say it is a "distillation" of InSwapper. If you look at it in the way deep learning works, it's about fitting the data. The current training code fits the outputs of InSwapper. It can also fit some deep fake images made by hand. As you can see, the training takes over 400,000 steps to converge. This means that we need a lot of "expected_output" images to train the model for higher resolution if we use the current training code. However, collecting that many hand-made deep fake images is very hard.
I also found that if I train the model with different resolutions images (128, 112, 96) at the same time, the generalization ability will be enhanced. We can observe this enhanced generalization ability from output resolutions of 64 x 64 to 1xx x 1xx. I'm wondering if it's possible to further enhance the generalization ability to higher resolutions like 256 x 256 or 512 x 512.
I agree with using GAN to train the model beacuse ImageA + ImageA latent = ImageA, so "expected_output" is not necessary.
As for the architecture of the STYLE BLOCK and the overall model architecture, I created it through trial and error. If you take a close look at the InSwapper graph, you will see the STYLE BLOCK appears multiple times. It is obviously made with a loop. Exporting the model with opset_version=10 makes it easier to compare the graph in Netron. However, it will cause issue #8, which I will remark in README.md later. Step 1. study the graph of InSwapper in NETRON and read the details from onnx.helper.printable_graph(model.graph) Step 2. modify the model architecture in StyleTransferModel_128.py Step 3. save the model in ONNX format so that I can compare it with the original InSwapper Step 4. if the model is not the same as the original architecture, go back to step 1.
@GeneralShan I have compared the model architectures of InSwapper and SimSwap. They are extremely similar. I’m going to start experimenting with changing the model architecture from SimSwap to InSwapper in the SimSwap repository.
good, i am performing similar experiment i will training simswap Discriminator against ReSwapper architecture i have read paper of both SIMSWAP and SIMSWAPPLUS
let us share results in this thread
@somanchiu Thank you for detailed explanations of your process!!
I agree that GAN is the necessary directions for training
One difficulty I have see already:
ID Retrieval for different alignment is lower, I observe
I think it will be good training on VGGFace2-HQ which is HIGH QUALITY dataset. Also there is EFHQ Dataset which focus on pose variability This includes face pose side profile and other pose that is not covered in FFHQ
I think best dataset will be combining VGGFace2-HQ and EFHQ augment with similar method
One more thing I see in NETRON:
We all know InSwapper work much better compare to SimSwap. Most likely because this is the 2x model parameters.. and larger feature size in bottleneck in GENERATOR. Also SimSwap use large image size which cause poor quality. Inswapper peform very well with 128.
Also, if you want to have discussion, we can talk about the SIMSWAP paper and architecture. I will be happy to have discussion on this. It is very interesting and relevant
Please also look in SIMSWAPPLUS, and see that there is new lower-GFLOP operator, but also some change in layer count and model architecture
We shoud select best architecture.
I think it will be good direction to train SMALL model with SimSwap size, on 128px (with 0.5x parameter count compare to InSwapper)
If the training is stable and promising, we can train InSwapper size model and larger.
@somanchiu
This means that we need a lot of "expected_output" images to train the model for higher resolution if we use the current training code. However, collecting that many hand-made deep fake images is very hard.
If you meant by dataset from handmade df or sd. Then if this can help https://github.com/SelfishGene/SFHQ-dataset.
In Privacy section of readme, Since all images in this dataset are synthetically generated there are no privacy issues or license issues surrounding these images.
@somanchiu
This means that we need a lot of "expected_output" images to train the model for higher resolution if we use the current training code. However, collecting that many hand-made deep fake images is very hard.
If you meant by dataset from handmade df or sd. Then if this can help https://github.com/SelfishGene/SFHQ-dataset.
In Privacy section of readme, Since all images in this dataset are synthetically generated there are no privacy issues or license issues surrounding these images.
In the current training code, not only "expected_output" is needed, we also need the target and the source image. "expected_output" created by target + source.
I look at SFHQ, quality seems very poor, compare to VGGFace2-HQ and FFHQ If you read methodology of SIMSWAP, there is no need for synthetic ground truth image SIMSWAP can express identity transformation in loss function
In the current training code, not only "expected_output" is needed, we also need the target and the source image. "expected_output" created by target + source. yes that is true in current code
but we should change to use GAN TRAINING instead SIMSWAP technique allow learning identity transform without expected output
RECONSTRUCTION LOSS in SIMSWAP and also DISCRIMINATOR LOSS can help crucial fidelity of generate output,,
@GeneralShan I have created the GAN branch. Do you have any updates on your experiment?
i work local. i have limited time this week ,, i will share progress i am making train run with SIMSWAP arcface FFHQ checkpoint as ID MODEL, and combine with SIMSWAP multiscale discriminator
I will train smaller model variant of RESWAPPER architecture wth your STYLE BLOCK design. i will share the result when it is ready
I will also follow your work closely and I will contribute
please share updat with me if you have train run stable
plealse also suggest use for you dataset VGGFace2-HQ: https://github.com/NNNNAI/VGGFace2-HQ
Please ensure FFHQ alignment in use, because VGGFace2-HQ is FFHQ. When pass target image align FFHQ.
Remember SIMSWAP ARCFACE is not align to FFHQ, which reduce ID retrieve performance, we may require finetune ARCFACE checkpoint with FFHQ align image
VGGFace2-HQ is not a good dataset. It's artificial high quality. Nothing will substitute for unmodified high quality dataset. I am currently building one much better than FFHQ by hand. Do you guys have a Discord or Telegram where we can collaborate?
In my opinion, findings and knowledge should be shared in a way that everyone can easily access without needing an account. I don't see any problems with connecting everyone via GitHub issues or discussions. If you all think we should communicate outside of GitHub, then we'll move. What do you all think?
I agree in principle. However real time chat is easier for collaboration on details that don't have to flood this forum.
Here is a repo that implemented all of the versions of the SimSwap paper:
VGGFace2-HQ is not a good dataset. It's artificial high quality. Nothing will substitute for unmodified high quality dataset. I am currently building one much better than FFHQ by hand. Do you guys have a Discord or Telegram where we can collaborate?
I am looking through VGGFaceHQ today and I agree. there are many bad labels, corrupt image, general bad data. I do not like it I have very much interest in your better dataset. I will like to collaborate.
I agree in principle. However real time chat is easier for collaboration on details that don't have to flood this forum.
I also have interest in real time chat for collaborate. May i suggest use of Matrix chat or Discord
when i look through VGGFace2-HQ, it becomes more clear to me why SIMSWAP is so bad quality compare to INSWAPPER. disregarding our other identification of smaller layers
@GeneralShan, @somanchiu and @aesanchezgh
There would be a three sets of images, target, source and "expected_output". Res could be depend for training 128, 256, 512. Or simply use target, source and "expected_output" with 512.
Upscaling images with multi-upscaler like GFPGAN+GPEN2048, or other can be added like codeformer or restoreformer, etc.
Enhancing micro-skin details, for this using SD+Flux.
Ya i know this process is very computationaly and time consuming. But if we are training a model from scratch then why not train it perfectly. So it could compete with original inswapper128.
Because i saw results of rewsapper's existing model that you trained, i integrated in Rope and results are tremendously good while retaining expressions intact.
@aesanchezgh the issue is that we are lack in ground truth. the SWAP task require GAN training, as describe in detail in SIMSWAP paper the EXPECTED OUTPUT is ground truth that does not exist the only way is to use EXISTING MODEL, which limit us to the existing model possible to get minor gain from selecting BEST ONLY output from INSWAPPER, but still very limited
if we follow SIMSWAP training method, this is what I am trying all that requires for SISMWAP training method, is dataset of many faces, and with at least MULTIPLE face of same person we have no requirement for any EXPECTED OUTPUT, because we express with LOSS FUNCTION
this mean:
if we have dataset with many diversity of face, we can train with GAN and SIMSWAP loss function
I look at your pipeline and problem is same as VGGFace2, you run automatic restoration which introduce many artifact and failure case if you have time possible to create dataset, please collect image of face, using public available and license data
i see big problem with VGGFace2, many image are corrupt, crop poorly, artifact, distort, wrong face, mislabel if we have fix version it will improve model large amount
i recommend modify your train code to use Pytorch AMP to train automatic mixed precision it speed up significantly when i have time later i can contribute the modification
when i look through VGGFace2-HQ, it becomes more clear to me why SIMSWAP is so bad quality compare to INSWAPPER. disregarding our other identification of smaller layers
Yes exactly.
@aesanchezgh @GeneralShan Let's chat in real-time on Discord server https://discord.com/invite/wVVPSjmj. Please don't forget to share your findings on GitHub. I spend most of my time on GitHub since my replies are slow like comments and there's no need for real-time chat. I'm not participating in the dataset discussions at the moment, as I am currently focusing on making the GAN training work.
@GeneralShan, @somanchiu and @aesanchezgh
- Upscaling images with multi-upscaler like GFPGAN+GPEN2048, or other can be added like codeformer or restoreformer, etc.
I have tested using CodeFormer to create higher resolution images a few days ago. I trained a model based on the latest ReSwapper checkpoint for tens of thousands of steps. The model output looks like plastic, with no details, no pores, and a very smooth face. I’m not sure if this is due to the super resolution images. Intuitively, the quality should not be that low.
The model output looks like plastic, with no details, no pores, and a very smooth face. I’m not sure if this is due to the super resolution images. Intuitively, the quality should not be that low.
That's why there's a step 3, using SD+Flux.
I'm not participating in the dataset discussions at the moment, as I am currently focusing on making the GAN training work.
Ok i think it's better to stop discussion on dataset. And focus on training part.
@somanchiu i attempt GAN training, but i have issues. discriminator becomes perfect quickly and its loss is zero, so generator is not train (loss fluctuate) did you observe better results?
@astalavistababe @aesanchezgh please move dataset to external discussion: https://github.com/somanchiu/ReSwapper/discussions
we will like to use this thread for discussion of GAN train
so i have studyed your code. it is very interesting as replication of INSWAPPER, but right now it is only distillation of inswapper. this will limit this model. we can use this replication as initialization point for generator
we must need to use GAN architecture, because inswapper is derive from SimSwap. We can use new SimSwapPlus architecture: https://www.scribd.com/document/784359864/SimSwap-Towards-Faster-and-High-Quality-Identity-Swapping
Inswapper original is base on simswap GAN. we can build replication with GAN of SimSwapPlus