Closed minkyujeon closed 5 years ago
In the paper they describe it as a "Projection Matrix" named P, so I interpreted it as a simple trainable Parameter tensor.
I don't know off the top of my head what you mean with the Mapping Network, but if that's the linear NN that tried to transform a latent vector into a linear matrix, I guess that it would also work but I'm not sure if it would necessarily be better and it seems a bit too complex.
In any case, I'm really not confident about my approach when it comes to using the embedded vector in the generator, so you can try that and see if it works.
Actually, I got confused of this section because paper didn't state too much about MLP, so I did some research about it. I found following picture which depicts Multi Layer Perceptron (MLP), now somehow we call Projection Matrix.
Just my personal guessing about MLP and still not sure about right meaning of it.
In my view, the role of MLP in the paper is to find a learnable way to predict AdaIN affine coefficeint (scale and translation) from embedding vector. Using projection matrix is a simpler way to achieve the goal and also reduce the computing resource.
It is not clear to me where they put the AdaIN layers? I guess they should be placed in Residual Block rather than in Upsampling layers of Generator network.
Hi @MrCaracara , I have rechecked the paper (section 3.4.Implementation Details) and found that the person-specific parameters psi_i is for normal Instance Normalization (not for AdaIN). However, in the current implementation these factors are used in AdaIN layers. Could you clarify this? Thank you.
I am very interested to hear about whether using a MLP instead of a simple matrix improves the results. @JeonMinkyu, @keishatsai, please let me know if you try that out.
@davidtranno1: In the paper it says exactly "adaptive instance normalization", so that's why I used it. They only use "regular (non-adaptive)" instance normalization in the encoder part of the generator. About the placement of the AdaIn layers, nothing is specified. The only thing that's clear is that they don't use them in the encoder part.
I am still confusing, the authors wrote: "The person-specific parameters psi_i_hat serve as the affine co-efficients of instance normalization layers"
If the psi parameter is used in AdaIN then the sentence should be: "The person-specific parameters psi_i_hat serve as the affine co-efficients of Adaptive instance normalization layers"
Some results yielded by using MLP. I use only one hidden layer with 4096 unit. The training is still in progress.
From left to right: Land mark -> Ground truth -> Generated image
@davidtranno1 Nice to see that it works so well with MLP! Did you get these results without using AdaIN then? I see you haven't updated your own fork.
Hi, I tried both approaches use AdaIN and without using it and found out that AdaIN play a highly important role in reducing identity gap between real and fake images. I re-implemented the main idea of the paper by my own with some reasonable changes to have a network can run on a descent resource GPU. On my low power 1060 NVIDIA GPU (6GB), I achieved nice results after two epoches.
I will public my work when done.
Hi, I tried both approaches use AdaIN and without using it and found out that AdaIN play a highly important role in reducing identity gap between real and fake images. I re-implemented the main idea of the paper by my own with some reasonable changes to have a network can run on a descent resource GPU. On my low power 1060 NVIDIA GPU (6GB), I achieved nice results after two epoches.
I will public my work when done.
Hi, Can I know some details of your implementation? Did you replace the e_psi in this code with MLP and use different MLP for every AdaIN parameters part? Or use MLP once and same parameters in every AdaIN layers?
Hi, I tried both approaches use AdaIN and without using it and found out that AdaIN play a highly important role in reducing identity gap between real and fake images. I re-implemented the main idea of the paper by my own with some reasonable changes to have a network can run on a descent resource GPU. On my low power 1060 NVIDIA GPU (6GB), I achieved nice results after two epoches. I will public my work when done.
Hi, Can I know some details of your implementation? Did you replace the e_psi in this code with MLP and use different MLP for every AdaIN parameters part? Or use MLP once and same parameters in every AdaIN layers?
@davidtranno1
Hi,
I used only one MLP for only one AdaIN layer. I found that increasing the number of AdaIN layer result in weird results. The paper also didn't specify the place of AdaIN, after some trial and errors, I found that it should be placed before up-sampling layer (after residual blocks).
In addition to this, I also found that feature matching term doesn't play an important role as perceptual loss. I tried to remove this loss (FM) and after that there is no sharp distinction between the two versions. In my view, FM is also a kind of perceptual loss and it could not dominate the one calculated by VGG19 and VGG Face.
hi @davidtranno1, did you open-source any code regarding your update to the network , that you describe,
I used only one MLP for only one AdaIN layer. I found that increasing the number of AdaIN layer result in weird results. The paper also didn't specify the place of AdaIN, after some trial and errors, I found that it should be placed before up-sampling layer (after residual blocks).
This is very interesting! This structure deviates a lot from the one described in the paper. Does that mean that you are still using Batch Normalization inside the (Upscale) Residual Layers? Do you perform AdaIN directly in the Generator outside of any Residual Block? And how do you map the outputs of the MLP to the inputs of the AdaIN layer?
I find the auther's explaination about adain layer on internet , someone send an email to ask , but I don't know how to upload an picture. "AdaIN can be viewed as an instance normalization with trainable affine parameters ( in adaptive variant what you have called target's mean and variance). We predict both of these vectors for each normalization layer using the output of the embedder network and an MLP." And he also talk about the architecture: "Basically we took a standard Johnson et al. architecture, replaced downsampling and upsampling layers with residual blocks(BigGAN style) and replaced all normalization layers in the residual blocks, operating at the same resolution, with adaptive instance normalization."
Hope this will provide some infomation. @MrCaracara
Hey, @hanxuanhuo
Yes, this much I understand. And that's the same that I have done in the code here. That's why I'm interested about how you would use only one AdaIN layer, since there are a lot more than one residual layer. I don't think these layers would work properly without any kind of normalization. The original layers had Batch Norm instead of AdaIN.
I only use instance normalization (not AdaIN) in residual blocks of the generator. The output of my MLP layer can be divided into two parts, the first one is for scale and the second one is for translation, these factor are learnable from embedding vector.
As I mentioned before, I only use very low power NVIDIA GTX 1060 (6G) to train the model, so I made reasonable changes to adapt my situation to have a run. In some part my network may not exactly identical with the origin paper, but it still yields very good results.
@davidtranno1
Can I know how fast you converge to get reasonable results?And about “scale” and “translation” part, did you mean the AdaIN layers “mean” and “std”? Did you use torchvision.transforms to normalize the data? I find another repo did not normalize the data and get resonable result.
Yes i mean scale and translation are mean and std respectively. I did not normalize the data, it is not necessary for me. After have a fake image, to have a nice display it is needed to multiple it with 255 (rescale the intensity).
On my GTX 1060, I achieved good result after 2 epoches (approximately one day of training). In addition to this, I found that VoxCeleb 2 data is too big, i used an alternative here http://www.robots.ox.ac.uk/~vgg/research/CMBiometrics/data/dense-face-frames.tar.gz If i have a more power GPU (e.g 2080Ti, Titan RTX) i will try other repo.
Hope that help.
I try to modify the code, after transforms.ToTensor(), I multiply 255. And I delete the generator's last sigmoid layer and instance normlization layer, just like this repo do. I delete the last 2 downsample resblock and the first 2 upsample resblock. Now I can converge very fast in few iterations to get human like head. So I think maybe the problem is the gradient is to small to update the network, so that you have to spend very long time to get converge. @MrCaracara result after 5000 iters, batch size 1:
I try to modify the code, after transforms.ToTensor(), I multiply 255. And I delete the generator's last sigmoid layer and instance normlization layer, just like this repo do. I delete the last 2 downsample resblock and the first 2 upsample resblock. Now I can converge very fast in few iterations to get human like head. So I think maybe the problem is the gradient is to small to update the network, so that you have to spend very long time to get converge. @MrCaracara
If that's the case, then that's great news! I guess the problem then has mainly been lack of patience. I wonder though what the effect is of removing the IN layer and activation function of the Generator. Have you compared the outputs with and without?
I try to modify the code, after transforms.ToTensor(), I multiply 255. And I delete the generator's last sigmoid layer and instance normlization layer, just like this repo do. I delete the last 2 downsample resblock and the first 2 upsample resblock. Now I can converge very fast in few iterations to get human like head. So I think maybe the problem is the gradient is to small to update the network, so that you have to spend very long time to get converge. @MrCaracara
If that's the case, then that's great news! I guess the problem then has mainly been lack of patience. I wonder though what the effect is of removing the IN layer and activation function of the Generator. Have you compared the outputs with and without?
for IN layer, another repo I try can converge without it, but I haven't get good result because lack of patience. I try to add relu after upsample resblock, but it seem very hard to converge compare with no relu.
According to figure 2 of the paper, the output of Embedder is passes thorugh the MLP(maybe Mapping Network in StyleGAN) and enters the generator, but in this code, there is no MLP(mapping network). Do I miss something in the paper or code?