vincent-thevenin / Realistic-Neural-Talking-Head-Models

My implementation of Few-Shot Adversarial Learning of Realistic Neural Talking Head Models (Egor Zakharov et al.).
GNU General Public License v3.0
828 stars 195 forks source link

Cannot get better Result like README #54

Open ghost opened 4 years ago

ghost commented 4 years ago

Hi . I really appreciate for your Implementation.

I try to execute your program,but result is too ugly. Here is my result.

スクリーンショット 2020-07-27 1 02 39 スクリーンショット 2020-07-27 1 50 24

I tried to run train.py for 28 epoch ,thenembedeer_inference.py,and finally finetuning_training.py for 150 epoch. I skipped some important training phase, did I?

If you have any suggestion about this result,please give me some advice.

(I'm not English speaker,so maybe this issue's grammer is wrong.I apologize for my poor English.)

Thanks, Nekomo

Jarvisss commented 4 years ago

Did you train on the full dataset?

I suggest put more results here and give more information about yout training, this may be userful for others to help you.

I also met the problem for not able to get good result.

Model trained on full dev set for ~4 epoch with 3 GTX 2080ti, with K=8 and the result is blurry, and I haven't run inference yet.

image

ghost commented 4 years ago

@Jarvisss Thanks for your comment.

Did you train on the full dataset?

Yes,I downloaded full dataset of VoxCeleb2 ,contains 18588 data, I trained my model on full dataset with 1 GTX 2080ti, with K=8.

It seems that your batch size was very larger than me.( my batchsize was only 2...) I didn't care that and it maybel the cause. I'm going to raise batch_size more and train again.

Thank you.

Nekomo

Jarvisss commented 4 years ago

Yes,I downloaded full dataset of VoxCeleb2 ,contains 18588 data,

For me, I actually got 5994 speakers and 145,569 videos in VoxCeleb2 dev

It seems that your batch size was very larger than me.( my batchsize was only 2...)

I use batchsize=6, as I have 3 gpus each 2 batch,

The results I showed maybe misleading where 'batch' should be 'step'

ghost commented 4 years ago

@Jarvisss Thank you for your many replies. Your suggestion is really helpful for me.

For me, I actually got 5994 speakers and 145,569 videos in VoxCeleb2 dev Sorry, I misunderstood about dataset size.

18588 means length of the dataLoader which is already preprocessed.

print(len(dataset))#37176
print(len(dataLoader))#18588

Actually,my dataset contains

and dev_mp4.zip's md5 checksum value doesn't match to Official Dev md5.

maybe my VoxCeleb2 dataset is broken... I also try to construct VoxCeleb2 dataset all over again.

The results I showed maybe misleading where 'batch' should be 'step'

Is 'step' means iteration during a epoch? (I'm sorry to be so inquisitive)

And,if you do not mind , could you give me your pretrained model? Of course, I do not use it on my study , without your permission.

Thanks, Nekomo

Jarvisss commented 4 years ago

@Nekomo

Is 'step' means iteration during a epoch?

Yes, it is

And,if you do not mind , could you give me your pretrained model?

ofcourse, how can I give you my model?

ghost commented 4 years ago

@Jarvisss

ofcourse, how can I give you my model?

Thanks a lot. Please upload your model to cloud strage,and share model's url,like this implementation's pretrained model. This method is best for me .

Nekomo

Jarvisss commented 4 years ago

@Nekomo Hi, sorry for late reply, I trained my model for another 10 epochs, and get comparable results to @vincent-thevenin,

the result can be seen here.

but when I ran the embedder_inference.py and finetuning_training.py, I also got ugly results.

and if I do not finetune the model and feed forward directly, I will get result like this: image

There must be something wrong, I'm still debugging now.

And I wonder what your result looks like without finetuning, could you please share the result?

Jarvisss

Jarvisss commented 4 years ago

I figure it out, the network is trained on 224 * 224, but the code in embedder_inference.py and video_inference.py crops the input to 256 * 256, which in my case causes the ugly result.

and if I do not finetune the model and feed forward directly, I will get result like this: image

And above issue was caused by my mistake, I forgot to set finetuning=False for feed forward prediction.

Here's my feed forward result: I will try to finetune the model and comment later.

Update: Finetune the model for 40 epochs

mingkaihu commented 4 years ago

Hi Jarvisss, Thanks for sharing. Your result looks stunning. I was wondering if you could share the steps for reference? Regards, Mingkai

Jarvisss commented 4 years ago

@mingkaihu
Hi, mingkai, The steps for me are as follow:

  1. run preprocess.py to get a lighter dataset
  2. run train.py
  3. run embedder_inference.py to get the embedding vector e_hat
  4. run finetuning_training.py to finetune the model, using the e_hat we've got in step 3
  5. run video_inference.py to get the result

You may first skip step 4 to see if the result is resonable. If true, then do the fine_tuning and step 5 again to see if the result is better.

Good luck, Jarvisss

ghost commented 4 years ago

@Jarvisss I'm sorry to reply late,and thank you for your suggestion.

And I wonder what your result looks like without finetuning, could you please share the result?

I ran embedeer_inference.py,and finetuning_training.py,so running model without finetuning was never tried.I also try that again.

I figure it out, the network is trained on 224 224, but the code in embedder_inference.py and video_inference.py crops the input to 256 256, which in my case causes the ugly result.

I got it. I 'll modify my local code too.

Thank you, Nekomo

mingkaihu commented 4 years ago

@mingkaihu Hi, mingkai, The steps for me are as follow:

  1. run preprocess.py to get a lighter dataset
  2. run train.py
  3. run embedder_inference.py to get the embedding vector e_hat
  4. run finetuning_training.py to finetune the model, using the e_hat we've got in step 3
  5. run video_inference.py to get the result

You may first skip step 4 to see if the result is resonable. If true, then do the fine_tuning and step 5 again to see if the result is better.

Good luck, Jarvisss

Thanks a lot for your feedback, Jarvisss. Regards, Mingkai

tengshaofeng commented 4 years ago

@Jarvisss hi, I am so appreciated with your gread job. What is you newest code? it is "https://github.com/Jarvisss/Realistic-Neural-Talking-Head-Models", right? I knew you have chage emberdder_inference.py from 256 to 224:

I figure it out, the network is trained on 224 224, but the code in embedder_inference.py and video_inference.py crops the input to 256 256, which in my case causes the ugly result.

but when I read the code in "https://github.com/Jarvisss/Realistic-Neural-Talking-Head-Models", it is still 256. So can you share your code with me? Thanks so much

Jarvisss commented 4 years ago

@tengshaofeng Yes, you are right. https://github.com/Jarvisss/Realistic-Neural-Talking-Head-Models is my implementation with a few changes to the origin code

the network is trained on 224 * 224

The network is actually trained on 256 256, where the real input is 224 224 and zero-padding to 256 * 256.

So the network still takes 256 256 as input, but I changed the cropped image to 224 224 and padding to 256 256 just like in training, instead of crop images to 256 256 and without padding during testing

tengshaofeng commented 4 years ago

@Jarvisss , sorry, I am confused now. Should I change it from 256 to 224 in embedder_infernce.py, finetuning_training.py and webcam_inference.py? what is the different betweent code in the master branch with yours?

ghost commented 4 years ago

@Jarvisss Hello, Thanks for your advice and forked branch, I reproduce the result like yours.

スクリーンショット 2020-09-23 16 00 58

Honestly I haven't understood why I got ugly result, so I will take the difference between your repo and this one and try to understand why. It seems that other developer is still discussing,so I keep this issue open.

I really appliciate your support .

thanks, Nenoko(Nekomo)

Jarvisss commented 4 years ago

@Jarvisss , sorry, I am confused now. Should I change it from 256 to 224 in embedder_infernce.py, finetuning_training.py and webcam_inference.py? what is the different betweent code in the master branch with yours?

@tengshaofeng Sorry for late reply, the code of my forked version (https://github.com/Jarvisss/Realistic-Neural-Talking-Head-Models/commit/da309304f47254917c60cb2b6932429eb12f7ec4) was created for purpose of PR, and the code of crop was not added to that commit.

By the way, what you should do is to crop the images to (224, 224), in webcam_demo/webcam_extraction_conversion.py, in function generate_landmarks, like this:

if input.shape[0]==input.shape[1] and input.shape[0]==224:
    pass
else:
    input = crop_and_reshape_img(input, preds, pad=pad, out_shape=224)
    preds = crop_and_reshape_preds(preds, pad=pad, out_shape=224)  

to make it consistent with the training data.

yours, jarvisss

Jarvisss commented 4 years ago

@Nenoko You may have a look at another issue, some ugly result may come from very different landmark shape of driving and source https://github.com/vincent-thevenin/Realistic-Neural-Talking-Head-Models/issues/12#issuecomment-685219937

lastapple commented 4 years ago

@Jarvisss Hi, can you upload your trained checkpoints to google drive to share with us? Thanks!

tengshaofeng commented 3 years ago

@Jarvisss thanks for your reply.

  1. I put a new image into examples/fine_tuning/test_images, and run the embeder_inference.py to get e_hat_images.tar.
  2. then run the finetuning_training.py given the new image and e_hat_images.tar to get finetuned_model.tar. the total epoch number is 40.
  3. finally, extract the landmark image from examples/fine_tuning/test_video.mp4 to generate fake face given e_hat_images.tar and finetuned_model.tar.

is it right?

I got the result like following: result and the given new image as following: image

I do not thing the result is good. Can u give me some advice?

Jarvisss commented 3 years ago

@tengshaofeng

What's the problem with your result?

tengshaofeng commented 3 years ago

@Jarvisss Can you see my shared images? Do u think there exist mistakes in my steps?

Jarvisss commented 3 years ago

@Jarvisss Can you see my shared images? Do u think there exist mistakes in my steps?

i see your result, but i dont understand whats the problem from the images you provide.

The steps are first embed image to code ,fine tuning, and then inference, as the author suggest in readme, can you share the landmarks for inference

tengshaofeng commented 3 years ago

@Jarvisss Can you see my shared images? Do u think there exist mistakes in my steps?

i see your result, but i dont understand whats the problem from the images you provide.

The steps are first embed image to code ,fine tuning, and then inference, as the author suggest in readme, can you share the landmarks for inference

the landmarks is in the middle of the image. image

ganqi91 commented 3 years ago

@Jarvisss hi, your result is so cool, but I want to known, if do not finetune , what the result like? could you share some your result ?

and, could you share your pre-trained model weight ?