Closed nitinmukesh closed 1 month ago
Hi. Your result is not incorrect.
We did not use the renderer of SadTalker but instead used the audio2exp module. Therefore, the image quality is not bound by SadTalker but rather by StyleNeRF.
Additionally, while SadTalker demonstrates good performance as you mentioned, some of its improvement comes from the use of super-resolution in post-processing.
Our work demonstrates superiority in 3D spatiality; however, it shows some limitations as it was not originally designed as a pipeline for animation. Our contribution is that we demonstrated the audio2face task in a NeRF-based 3D-aware feature space. For more details, please refer to our paper.
[Additional movements] In this work, we did not focus on eyes blink. And, by modifying pose(camera) parameters, you can vary the movements as the samples in our project page.
Thanks.
@rlgnswk
Thanks for your response.
Please could you give some sample inference command and sample data for the following
https://rlgnswk.github.io/NeRFFaceSpeech_ProjectPage/
I have identified few issues in the code, will raise PR to fix them and will create a video guide on how to install on Windows. It took me several hours. The 2 inference commands I tested on Windows 11 RTX 4060 Laptop GPU 8 GB VRAM / 16 GB RAM (8 GB shared)
The VRAM requirement is very less for this tool and fast inference speed.
We’ve updated the pose-conditioned code you mentioned for 2 and 3 (1 is variation of 2).
We uploaded it as quickly as possible, but keep in mind there may be some errors in the code.
By the way, are you planning to create a web UI? What did you mean by your last comment? what is for?
Thanks
Thank you.
Would request to provide some sample data for motion_guide_img_folder which i can refer and create my own, if required.
By the way, are you planning to create a web UI?
I am not a developer but just a youtuber and create videos on new AI tools, which cover installation on Windows. I did however created Gradio based UI for various tools using ChatGPT and Claude. Will try for this one too once I am familiar with all the commands.
What did you mean by your last comment? what is for?
The 2 inference commands I tested on
Windows 11
RTX 4060 Laptop GPU 8 GB VRAM / 16 GB RAM (8 GB shared)
The VRAM requirement is very less for this tool and fast inference speed.
If you are asking about this comment, I just mentioned that this tool works on less VRAM systems and inference is very fast. I have also mentioned the test environment where I run this tool successfully.
Thanks for the reply. You're a tech youtuber, That's awesome!!
I've completed the update related to this, please check it.
I believe the direction of this research is promising, but it's not yet at a stage where it can be applied practically. The Sadtalker series, which you mentioned, trained on existing video datasets, is more practical for now. Our work has limitations with the image dataset. It has its strengths, but there are certainly areas that need improvement. Especially from the perspective of the general user.
Thank you for showing interest in my research.
Thank you. I became a little tech aware while making these AI tool work on Windows. It's very very difficult for a non-tech person. :)
I am unsure how to use this command, what should be specified for motion_guide_img_folder. Please do you have some motion frames that you used (project page) which I can try. I did check the test_data folder but nothing is there for motion.
python StyleNeRF/main_NeRFFaceSpeech_audio_driven_w_given_poses.py \
--outdir=out_test_given_pose --trunc=0.7 \
--network=pretrained_networks/ffhq_1024.pkl \
--test_data="test_data/test_audio/AdamSchiff_0.wav" \
--test_img="test_data/test_img/AustinScott0_0_cropped.jpg"\
--motion_guide_img_folder="driving_frames"
You can see the "driving_frames" folder not inside of test_data but the working directory.
That frames are used for extracted poese by the face detector.
Oh ok. Thank you, I somehow missed the folder.
Will test now.
I installed it and generated using the commands mentioned in Readme. I have not modified any options. Why is the output so bad. I see that you use Sadtalker, but SadTalker produce very good output.
No eyes blink, no head movement, etc,,,
Command (Generated from Latent Space)
https://github.com/user-attachments/assets/331b547a-4d5a-4076-984b-73d1325ee49b
Command (Generated from Real Image)
https://github.com/user-attachments/assets/ada75bd7-edbf-42a1-9725-01b4e679ee0d