sstzal / DFRF

[ECCV2022] The implementation for "Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis".
MIT License
335 stars 40 forks source link

There is something wrong when I rendering a new video with a new audio. #25

Closed boyaom closed 1 year ago

boyaom commented 1 year ago

Hi, this's an amazing work, but I didn't get a good result when using the code. I'd like to render a new video with cnn_50000_head.tar and an audio about 6 second cut from cnn.mp4. So I preprocess the cnn.mp4 to get torso_bcs, poses and so on, and after I use extract_ds_features.py to get features from new audio. When I finished these, I ran render.sh, of course, I changed some values like near, far, names, iters and datasets. Then, I get my first problem in terms of total number of frames. It looks that the total frames is depended on the pose shape rather than audio we input. And when I got the right number of frames, the head and torso is out of sync, the lips and voice is out of sync too. Besides, the head is shaking along the frames. Is there anything wrong with my operation or other reasons?

https://user-images.githubusercontent.com/90528002/218987203-0620a32c-5a33-41d6-a2a2-90e0e7177539.mp4

cucdengjunli commented 1 year ago

+1

cucdengjunli commented 1 year ago

same question, it seem is a bug of this code

silvia2021 commented 1 year ago

same, how to fix this?

sstzal commented 1 year ago

Hi, this's an amazing work, but I didn't get a good result when using the code. I'd like to render a new video with cnn_50000_head.tar and an audio about 6 second cut from cnn.mp4. So I preprocess the cnn.mp4 to get torso_bcs, poses and so on, and after I use extract_ds_features.py to get features from new audio. When I finished these, I ran render.sh, of course, I changed some values like near, far, names, iters and datasets. Then, I get my first problem in terms of total number of frames. It looks that the total frames is depended on the pose shape rather than audio we input. And when I got the right number of frames, the head and torso is out of sync, the lips and voice is out of sync too. Besides, the head is shaking along the frames. Is there anything wrong with my operation or other reasons?

cnn_.mp4

It seems that you use an audio feature whose value is empty. Is your "aud_file == 'aud.npy'"? For cross audio driven , please don't use an audio file named as 'aud.npy'. Please check Line 57 in 'DFRF/NeRFs/load_audface_multiid.py' to find the reason.

boyaom commented 1 year ago

Hi, this's an amazing work, but I didn't get a good result when using the code. I'd like to render a new video with cnn_50000_head.tar and an audio about 6 second cut from cnn.mp4. So I preprocess the cnn.mp4 to get torso_bcs, poses and so on, and after I use extract_dsfeatures.py to get features from new audio. When I finished these, I ran render.sh, of course, I changed some values like near, far, names, iters and datasets. Then, I get my first problem in terms of total number of frames. It looks that the total frames is depended on the pose shape rather than audio we input. And when I got the right number of frames, the head and torso is out of sync, the lips and voice is out of sync too. Besides, the head is shaking along the frames. Is there anything wrong with my operation or other reasons? cnn.mp4

It seems that you use an audio feature whose value is empty. Is your "aud_file == 'aud.npy'"? For cross audio driven , please don't use an audio file named as 'aud.npy'. Please check Line 57 in 'DFRF/NeRFs/load_audface_multiid.py' to find the reason.

Thanks, you are right. I tried it again and get a better result.

Suvi-dha commented 10 months ago

Hi @boyaom please help me why do I get render outputs that looks like this image

these looks like 3dmm render and not the realistic looking videos with changed audio. May you please share your inference method and let me know were I am going wrong. I am using cnn_500000_head.tar as pretrained model and changed the audio path to new audio.

Thanks.

boyaom commented 10 months ago

Hi @boyaom please help me why do I get render outputs that looks like this image

these looks like 3dmm render and not the realistic looking videos with changed audio. May you please share your inference method and let me know were I am going wrong. I am using cnn_500000_head.tar as pretrained model and changed the audio path to new audio.

Thanks.

Hi. Based on your description, it seems that you are using cnn's pretrained model to learn cnn2_25fps video, which may be the cause of the problem, perhaps you should increase the training time or not use this pretrained model.

Suvi-dha commented 10 months ago

Thank you! @boyaom