Open aLohrer opened 1 year ago
Ok, I went to through the training code once more and realized that the facial perception loss is not implemented.
Might this cause the above mentioned issues ? Will you release the source code for the facial perception loss ?
May I ask if you are single card training or multi-card training, currently I use single card training, the effect is very poor.
@aLohrer May I ask if you are single card training or multi-card training, currently I use single card training, the effect is very poor.
Single Card,
can you share your results for comparison ?
I think its mostly a problem caused by the missing facial perception loss in my case.
I use single card Nvidia 4090, train 71h, 300k step. the effect is very poor.
Hi, @aLohrer . I meet the same question, and I also found that there is no Facial perception loss in the code. Do you address this problem? If so, Could you share your found? Thanks, hope you have a good day.
Hi, congrats on the great paper!
I want to try to port this nice work to a mobile. But before getting started with performance I tried to reproduce the results.
As suggested I went with one of the SD Sytsles as it seemed easy to generate data. I tried the clipart style.
Here is an example of a generated clipart image![image](https://user-images.githubusercontent.com/7009820/231542802-0cb507d7-c2e5-4b8d-a8b5-eb5c56052c4a.png)
they all look pretty good. Afterwards I went on to genreate samples via stylegan2.![image](https://user-images.githubusercontent.com/7009820/231543185-c07e6840-b92f-466e-9bc0-d2a7a708f60d.png)
I realized that the generated cartoon samples of Stylegan are only 256x256 is that an issue ?
Anyway, next step is training the texture translator, starting with anime model as initial weights.
Iteration0 (basically anime style)![image](https://user-images.githubusercontent.com/7009820/231543812-b68a9ea6-cd4e-48a9-b13b-d9c0087ca715.png)
Iteration 1000![image](https://user-images.githubusercontent.com/7009820/231544033-49ffeb7a-9bf1-442c-93c2-9f6007704dc0.png)
Iteration 10000![image](https://user-images.githubusercontent.com/7009820/231544267-96465de5-9b18-4b83-9f7f-4a2cd2796485.png)
Iteration 30000![image](https://user-images.githubusercontent.com/7009820/231544512-3dc8e48c-da78-4899-9089-58807dfff96a.png)
Iteration 100000
![image](https://user-images.githubusercontent.com/7009820/231545276-78761881-69cb-47fd-9430-ceaaacbb0ef9.png)
Here are the loss curves![image](https://user-images.githubusercontent.com/7009820/231545462-88847989-fa9e-4a33-9348-23d3b4f60a65.png)
From the images I saw so far, it really is catching the style nicely. But it has a major problem with teeth. Unfortanetly thats quiet an important facial part.
My question is:
Did I mess something up in the training procedure. E.g. I saw In your paper under Fig. 11 a similar effect which is countered by the facial perception loss. Is changing the weight of the facial perception loss a good idea to get better teeth (less content faithfull, but better looking)
Is the style just not usable with the framework and I should go for some other style instead ?
Or is it just an issue with SD generated data.
I am happy to test any different style to validate the training process, if you can point me to a dataset I should use and some intermediate results which are expected to be achieved.
Bonus Question - this is just loud thinking:
As my final goal is to get something really performant. I would like to switch out the Unet by a mobilenet v3. I am currently not sure if a mobilenet can pick up the unsupervised training signal or if it would be better train Unet first and use a teacher / student approach to transfer the training resuluts to a mobilenet in a supervised training fashion. Did you test out different architectures forthe texture translation block ?
Sorry for the many questions, but its such an interesting work I could ask 100 more (but I wont, promised :crossed_fingers: )