Result are very bad with latest multi gpu update training

akshay951228 commented 5 years ago

Hi, After going to your code I got few questions: 1) In the embedder final layer there is no activation function https://github.com/grey-eye/talking-heads/blob/master/network/network.py#L228 and generator as well https://github.com/grey-eye/talking-heads/blob/master/network/network.py#L158. Can I know the reason behind it

2) you modified [B,K, 6, 256, 256] to [BxK, 6, 256, 256] in Embedder and finally your getting (B, E_VECTOR_LENGTH, 1), in this if the batch size is two then embedder try to find pose independent information of Person but here we are giving 2 person information with stacking up, maybe I think at the end embedder is suffering and not learning any information

castelo-software commented 5 years ago

Hi @akshay951228,

Last commit was still not producing good results, I just pushed it to commit a major change in how the residual layers work in order to integrate AdaIN layers properly. I just made a new commit now with all the changes since then, and now the network seems to produce reasonable results.

As for your questions:

1a. The embedder has always had an activation layer at the end. If what you're asking is why the ReLu is after the pooling and not before, well, that's because it doesn't matter. ReLu will simply cap all negative values, whereas pooling will select the maximum value for each channel, which means that it makes no difference to first cap negatives or first select maximum values: the result will be the same.

1b. The generator had no activation layer in the previous commit because I realized that since I'm using ImageNet normalization on the frames, the generated images should end up with the same range of values, which is different for every channel. So I decided I would for now let the Generator learn it on its own. In this version, I am not performing ImageNet normalization when extracting the data, so the generated images should have values in the range [0, 1], and that allows me to do a sigmoid at the end of the Generator. I am still doing ImageNet normalization inside the loss function in order to be able to use the VGG model.

Every item in a batch B, will be a different person. I'm compacting the B and K dimensions into one because the Embedder layers can only accept one batch dimension. At the end, the result is a number of vectors equal to the number of batches B. That means, if you give frames for 3 different people to the embedder, you'll get 3 different embedding vectors, one for each person, all independent form each other.

akshay951228 commented 5 years ago

firstly , thanks for fast response

1a )my bad not at Line L228 is at https://github.com/grey-eye/talking-heads/blob/master/network/network.py#L243, after bmm there is not activation function in older commit there is tanh activation , but you are not using any activation

1b ) you can use norm [0,1] , scale it to 255 and do the imagenet norm before passing to vgg mode , have a look in below link , maybe it will help https://github.com/xthan/VITON/blob/e6b560225975ddd40d96359cc13f8d66b975aa20/utils.py#L488

yep , I did some experiment , what I found is , let say I take batch 2 and load in single gpu model wasn't learn anything , but If I load 1 batch for embedder , It give some result and it also improve as training progress.

Did batch size more 1 work for you?

castelo-software commented 5 years ago

1b ) you can use norm [0,1] , scale it to 255 and do the imagenet norm before passing to vgg mode , have a look in below link , maybe it will help

This is exactly what is happening right now. The ImageNet normalization is done inside the loss function, before passing to VGG. And I'm not doing any normalization of the images myself, but when converting them to a tensor PyTorch automatically scales them to [0, 1]

Did batch size more 1 work for you?

I haven't had the time to let it train enough to get really good results, but with a batch size of 2 I have managed to get silhouettes of faces with a similar color pattern from the source image after some 10 hours of training on the smaller test set.

akshay951228 commented 5 years ago

after 50 epoch there are result I got which latest commit with small dataset 20190803_115719217532_x 20190803_115719239546_x_hat

Did I still need to train or any suggestion from your side is very helpfull

busning commented 5 years ago

Similar results with you @akshay951228 after 1000 epochs using 200 videos.

castelo-software commented 5 years ago

200 videos is probably way too little. The full dataset used in the paper has ~140000 videos. I am training with a subset of 120000 due to memory issues.

I still haven't managed to get real results either yet, though. So far the network produces things like this:

20190805_090851516670_x_hat 20190805_084324402346_x_hat

akshay951228 commented 5 years ago

@MrCaracara at which epoch you got this above results?

castelo-software commented 5 years ago

That is after 2 epochs

busning commented 5 years ago

I also tried using the whole dataset to train, and set different batch sizes to compare.

Batch_size = 1, after 3 epochs:

20190805_161254199641_x_hat

Batch_size = 32, after 5 epochs:

20190805_043112075014_x_hat

It seems that larger batch size leads to worse results.

In addition, I also want to show the results below without using MLP, after 1.5 epochs, the results seem better than using MLP.

20190801_010858523193_x_hat

@MrCaracara

akshay951228 commented 5 years ago

yes @busning there is problem with the batch size after some debugger I found the result are wired because of the embedder network and there was no problem with generator and discriminator as far as I now

maybe we should focus on embedder @MrCaracara to solve this batch size issue

busning commented 5 years ago

Thank you for pointing out the right direction, I will focus on debugging the embedder today. @akshay951228

castelo-software commented 5 years ago

If that's the result you get when using batches, then I guess the must be a problem must with the collapsing of the B and K dimensions of the training frames, like talked in the firsts posts of this thread. It could be that data is leaking from one batch to the other when passing through the Embedder.

As for the use of MLP, I added it to try it out, since someone claimed in a different thread that it helped their results, so I'm surprised to see that you get decent results in just 1.5 epoch without it. I guess I will restart my training with just a projection matrix.

After 3 epochs this is the result that I get with the code as it is right now: 20190806_091525321137_x_hat last_result_x_hat

castelo-software commented 5 years ago

@woshixylg: Indeed, that's what they call the feed forward version of the algorithm. The switch to turn L_mch on or off is already present in LossD, and these results are without L_mch.

@MrCaracara https://github.com/MrCaracara Similar results with you after 3 epochs using batch_size = 1. However, I notice that the losses are not decreasing when training more epochs. The paper mentioned that they started to train the network 150 epochs without using L_mch, and I tried it today and am waiting for the results. I guess the reason that the losses don't decrease further is that more weights are arranged for L_mch.

akshay951228 commented 5 years ago

@MrCaracara In ResidualBlockDown you passed input to relu activation(https://github.com/grey-eye/talking-heads/blob/master/network/components.py#L96) instead of conv , Is there reason behind it? we have input norm to [0 -1] and apply relu to it . as which is equal to same output as input

castelo-software commented 5 years ago

@MrCaracara In ResidualBlockDown you passed input to relu activation(https://github.com/grey-eye/talking-heads/blob/master/network/components.py#L96) instead of conv , Is there reason behind it? we have input norm to [0 -1] and apply relu to it . as which is equal to same output as input

That is simply the implementation of the original Residual Block Down from BigGAN, as referenced in the paper (See Figure 15)

I assume the reason why they added a ReLU there is to make sure the data is always in the range [0, 1] throughout the entire network, as the final output will have to be in that range (anything outside would be an invalid RGB value).

hanxuanhuo commented 5 years ago

Hi @MrCaracara ， I try this repo and get converge very fast (about 4000iter) to the result like your result after 3 epoch. His code did not scale the data, just keep [0-255]. ( I modify his perceptual loss )

akshay951228 commented 5 years ago

hi there is no activation function at end of discriminator https://github.com/grey-eye/talking-heads/blob/master/network/network.py#L243 and I was little bit confusion in W_i , and he relate this to e_hat in fine-tune, can you throw some light on it

castelo-software commented 5 years ago

hi there is no activation function at end of discriminator https://github.com/grey-eye/talking-heads/blob/master/network/network.py#L243 and I was little bit confusion in W_i , and he relate this to e_hat in fine-tune, can you throw some light on it

I tried adding a sigmoid and a tanh and the end of the of discriminator, but then the loss of D would stall, so that's why I ended up removing it. The output doesn't really mean anything besides that the higher the value the more likely an image is to be correct, since there are no labels, so that's why I also considered it unnecessary.

About W, when training for fine tuning, every column of that matrix is supposed to be similar to the e vector of the corresponding video. So far I have only trained with the feed forward model, which doesn't do this.

castelo-software commented 5 years ago

In addition, I also want to show the results below without using MLP, after 1.5 epochs, the results seem better than using MLP.

@busning, did you change anything else besides replacing the MLP back with a matrix to get the results in the last image?

akshay951228 commented 5 years ago

tried with 3000 data after 1 epochs 2_499

after 12 epoch 13_599

I did some reordering in resblock and added sigmoid activation function at end of discriminator and I'm not using MLP ,just using projection matrix and its working with batch size

castelo-software commented 5 years ago

That's the best results I have seen so far! Great job! What kind of reordering did you do? If you make a pull request I'll accept it and try to train it with the entire dataset.

On Wed, Aug 7, 2019, 13:41 akshay kumar notifications@github.com wrote:

tried with 3000 data after 1 epochs [image: 2_499] https://user-images.githubusercontent.com/25878082/62616423-147e5000-b92d-11e9-8fd7-0e0e3d01c2b3.png

after 13 epoch [image: 13_599] https://user-images.githubusercontent.com/25878082/62616443-265ff300-b92d-11e9-8825-566f2b4d2642.png

I did some reordering in resblock and added sigmoid activation function at end of discriminator and I'm not using MLP ,just using projection matrix and its working with batch size

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grey-eye/talking-heads/issues/24?email_source=notifications&email_token=ABWUCG6XK4ATYE2BJ5HKU4TQDKRG7A5CNFSM4II2Z6CKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3X7PQA#issuecomment-519043008, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWUCG45LK4LQ2JMM5N73R3QDKRG7ANCNFSM4II2Z6CA .

akshay951228 commented 5 years ago

Instead of pull request I describe what I did exactly two modification done 1) changed two res block ` class AdaptiveResidualBlock(nn.Module): def init(self, channels): super(AdaptiveResidualBlock, self).init()

    self.in1 = AdaIn()
    self.in2 = AdaIn()

    self.conv1 = ConvLayer(channels, channels, kernel_size=3, stride=1)
    self.conv2 = ConvLayer(channels, channels, kernel_size=3, stride=1)

def forward(self, x, mean1, std1, mean2, std2):
    residual = x
    out = F.relu(self.in1(self.conv1(x),mean1,std1))
    out = self.in2(self.conv2(out),mean2,std2)
    out = out + residual

    return out

class ResidualBlock(nn.Module): def init(self, channels): super(ResidualBlock, self).init() self.conv1 = ConvLayer(channels, channels, kernel_size=3, stride=1) self.in1 = nn.InstanceNorm2d(channels, affine=True) self.conv2 = ConvLayer(channels, channels, kernel_size=3, stride=1) self.in2 = nn.InstanceNorm2d(channels, affine=True)

def forward(self, x):
    residual = x
    out = F.relu(self.in1(self.conv1(x)))
    out = self.in2(self.conv2(out))
    out = out + residual
    return out

` 2)I just added sigmoid activation function after this line https://github.com/grey-eye/talking-heads/blob/master/network/network.py#L243 I think this activation function which help me converge faster

hanxuanhuo commented 5 years ago

tried with 3000 data after 1 epochs

after 12 epoch

I did some reordering in resblock and added sigmoid activation function at end of discriminator and I'm not using MLP ,just using projection matrix and its working with batch size

Did you change the discriminator's loss? I don't think the sigmoid function witch output [0-1] fit the hinge loss witch from [-1, 1].

akshay951228 commented 5 years ago

@hanxuanhuo I think the hinge loss is not [-1,1] hinge_loss

castelo-software commented 5 years ago

Using sigmoid with the output of D does seem to help the model converge much quicker! However, both losses stagnate very quickly too. Do yours also remain around these values? Loss_E_G = 0.0985 Loss_D = 1.0001 Loss_E_G = 0.1116 Loss_D = 1.0018 Loss_E_G = 0.1088 Loss_D = 1.0092 Loss_E_G = 0.1202 Loss_D = 1.0012 Loss_E_G = 0.1073 Loss_D = 1.0000 Loss_E_G = 0.1028 Loss_D = 1.0007 Loss_E_G = 0.0949 Loss_D = 1.0016 Loss_E_G = 0.1048 Loss_D = 1.0010 Loss_E_G = 0.0941 Loss_D = 1.0014 Loss_E_G = 0.1087 Loss_D = 1.0004 Loss_E_G = 0.1067 Loss_D = 1.0000 Loss_E_G = 0.1116 Loss_D = 1.0003 Loss_E_G = 0.1265 Loss_D = 1.0021 Loss_E_G = 0.1203 Loss_D = 1.0003 Loss_E_G = 0.1136 Loss_D = 1.0001 Loss_E_G = 0.0956 Loss_D = 1.0003 Loss_E_G = 0.1091 Loss_D = 1.0001 Loss_E_G = 0.1127 Loss_D = 1.0009 Loss_E_G = 0.1095 Loss_D = 1.0031

akshay951228 commented 5 years ago

yep @MrCaracara there are values right now for my training

epoch :16 ,step : 662 E_D:0.0879344791173935 D:1.000479817390442 epoch :16 ,step : 663 E_D:0.087563157081604 D:1.166745662689209 epoch :16 ,step : 664 E_D:0.08357848972082138 D:1.166815996170044 epoch :16 ,step : 665 E_D:0.026573803275823593 D:1.0364704132080078 epoch :16 ,step : 666 E_D:0.043518807739019394 D:1.0275287628173828 epoch :16 ,step : 667 E_D:0.06920713186264038 D:1.0043773651123047 epoch :16 ,step : 668 E_D:0.07651181519031525 D:1.0000147819519043 epoch :16 ,step : 669 E_D:0.08825021982192993 D:1.0003926753997803 epoch :16 ,step : 670 E_D:0.08158326894044876 D:1.1674630641937256 epoch :16 ,step : 671 E_D:0.08684965968132019 D:1.028291940689087 epoch :16 ,step : 672 E_D:0.07762998342514038 D:1.0001451969146729 epoch :16 ,step : 673 E_D:0.09002198278903961 D:1.0000580549240112 epoch :16 ,step : 674 E_D:0.09793408960103989 D:1.0000355243682861 epoch :16 ,step : 675 E_D:-0.19194158911705017 D:1.3097941875457764 epoch :16 ,step : 676 E_D:-0.25006210803985596 D:1.3337972164154053 epoch :16 ,step : 677 E_D:0.07950375974178314 D:1.00016450881958 epoch :16 ,step : 678 E_D:0.09060948342084885 D:1.0008139610290527 epoch :16 ,step : 679 E_D:0.09333132207393646 D:1.0000040531158447 epoch :16 ,step : 680 E_D:0.07605766505002975 D:1.0002230405807495 epoch :16 ,step : 681 E_D:0.07999680936336517 D:1.0001215934753418 epoch :16 ,step : 682 E_D:-0.0832897275686264 D:1.164804220199585 epoch :16 ,step : 683 E_D:0.07924594730138779 D:1.000558853149414 epoch :16 ,step : 684 E_D:-0.04413623735308647 D:1.1611897945404053 epoch :16 ,step : 685 E_D:0.056591056287288666 D:1.0390490293502808 epoch :16 ,step : 686 E_D:0.08398791402578354 D:1.0063879489898682 epoch :16 ,step : 687 E_D:0.08140194416046143 D:1.0005378723144531 epoch :16 ,step : 688 E_D:-0.09771829843521118 D:1.1667736768722534 epoch :16 ,step : 689 E_D:-0.15706413984298706 D:1.2116470336914062 epoch :16 ,step : 690 E_D:0.08421777933835983 D:1.0044937133789062 epoch :16 ,step : 691 E_D:0.07601674646139145 D:1.2363193035125732 epoch :16 ,step : 692 E_D:0.09327101707458496 D:1.0000172853469849 epoch :16 ,step : 693 E_D:0.08491037786006927 D:1.0000053644180298 epoch :16 ,step : 694 E_D:0.07348034530878067 D:1.013258457183838 epoch :16 ,step : 695 E_D:0.08878153562545776 D:1.002532720565796 epoch :16 ,step : 696 E_D:-0.2234032154083252 D:1.3043357133865356 epoch :16 ,step : 697 E_D:0.07888448238372803 D:1.1667258739471436 epoch :16 ,step : 698 E_D:-0.06074731796979904 D:1.1514358520507812 epoch :16 ,step : 699 E_D:0.0816541537642479 D:1.00107741355896 epoch :16 ,step : 700 E_D:0.08438102900981903 D:1.0000966787338257 epoch :16 ,step : 701 E_D:0.08242268860340118 D:1.0000488758087158 epoch :16 ,step : 702 E_D:0.08105947822332382 D:1.0000821352005005 epoch :16 ,step : 703 E_D:0.08781936764717102 D:1.0000382661819458 epoch :16 ,step : 704 E_D:0.08520457148551941 D:1.1619987487792969 epoch :16 ,step : 705 E_D:-0.030515065416693687 D:1.113174319267273 at start results initial

present results

current

busning commented 5 years ago

@MrCaracara I didn't change anything except for not using MLP.

busning commented 5 years ago

@akshay951228 I still cannot find the errors for multiple GPU training for the embedder using larger batch size. Can you show me some hints where I should focus on? I only checked the embedder and the projection layer.

akshay951228 commented 5 years ago

@busning, I kept DataParallel to embedder an not for generator and discriminator with batch size 4 with four gpu mean generator and discirminator in single gpu with batch size 4 and embedder in 1 batch size in each gpu ,and this working in this but when embedder taking batch size more than one , not working. this how I came to know that there is issue with embedder but I made some changes to generator resblock and add activation function to discriminator which surprising solves batch size problem I'm able to run batch size 4 in 2 gpus

castelo-software commented 5 years ago

So in short: the meta-training process is solved! Here are the things that finally fixed it:

Residual Blocks should be like this ones in StyleTransfer not the ones in BigGAN
The Discriminator needs a sigmoid at the end
The Embedder can't run in Parallel. I divide the load of the model in two GPU's by distributing different networks in different GPU's. Currently G and D run in CUDA:0 and E and the loss modules in CUDA:1. This allows you to run big batches without problems (currently training with batch size 4 and the entire dataset of ~140k)

akshay951228 commented 5 years ago

hi @MrCaracara , How are the training results?

castelo-software commented 5 years ago

These are some results in the middle of epoch 43, using almost the full dataset (exactly 140000 videos) and batch size of 3.

akshay951228 commented 5 years ago

did you tried fine-tune with latest checkpoint? .If not please try ,so that we can compare with author result

castelo-software commented 5 years ago

@akshay951228 This is done using the feed forward model, which is not compatible with fine tuning. Before I can try fine tuning, I will have to fix LossMCH, but I haven't had the time to look into it. When I do, I can start training with it.

For now I'm just letting my server train on this model while I focus on other matters.

ghost commented 5 years ago

@MrCaracara what are your thoughts on this https://github.com/bj80heyue/One_Shot_Face_Reenactment new paper one shot face reenactment with pretrained models

castelo-software commented 5 years ago

It looks interesting, I will read it when I have the time to see how they approached the issue, but it does seem like their results are not as good as those of Samsung AI

ghost commented 5 years ago

@MrCaracara yeah the results arent as good and they are not releasing the landmark extract model but have released the pretrained models

grey-eye / talking-heads

Result are very bad with latest multi gpu update training #24