tencent-ailab / V-Express

V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images.
2.26k stars 283 forks source link

how to keep source and export video the same (rather than using image and crop) #9

Open cantonalex opened 6 months ago

cantonalex commented 6 months ago

on a 4 second video?

tiankuan93 commented 6 months ago

on a 4 second video?

It is normal!

Now inference does take a long time. You can try using smaller sampling steps with num_inference_steps, which will reduce the time linearly. We recommend using 20-30 steps.

cantonalex commented 6 months ago

Thanks one other question @tiankuan93 to add new audio/lips to existing video is this the correct args

python scripts/extract_kps_sequence_and_audio.py \ --video_path "./destinationVideo.mp4 \ --kps_sequence_save_path "./kpsOfDestinationVideo.pth" \ --audio_save_path "./audioToReplaceLips.mp3"

like the first video on this comment https://github.com/tencent-ailab/V-Express/issues/6#issue-2320395941

tiankuan93 commented 6 months ago

Thanks one other question @tiankuan93 to add new audio/lips to existing video is this the correct args

python scripts/extract_kps_sequence_and_audio.py --video_path "./destinationVideo.mp4 --kps_sequence_save_path "./kpsOfDestinationVideo.pth" --audio_save_path "./audioToReplaceLips.mp3"

like the first video on this comment #6 (comment)

You're right. Then you can use .mp3 and .pth as inputs for video generation. Note that you need to use the --retarget_strategy "naive_retarget" parameter when generating. If the result is not satisfactory, you need to consider choosing a video that is closer to the reference image pose as the target video.

cantonalex commented 6 months ago

is a reference image always required? you can't just take a video and say apply audio to this video?

tiankuan93 commented 6 months ago

is a reference image always required? you can't just take a video and say apply audio to this video?

  1. Currently we need a more frontal image as a reference for generating a video.
  2. The scenario you describe is also possible with a combination of audio, reference image, and target pose. Maybe we can implement some scripts to make the required parameters simpler.
cantonalex commented 6 months ago

that would be great. the readme is great just sometimes a bit confusing.

i'm just a bit confused how @zhanghongyong123456 did this first video? https://github.com/tencent-ailab/V-Express/issues/6#issue-2320395941

im assuming he used a reference image from the same target video but he mentions no retarget strategy

EDIT: I think i confused that was his source video :( I thought he got an amazing result lol

tiankuan93 commented 6 months ago

I didn't quite get what you meant. If you want an image to talk, but only have an audio(.mp3), then you can use the following script.

python inference.py \
    --reference_image_path "./test_samples/A.jpg" \
    --audio_path "./test_samples/aud.mp3" \
    --output_path "./output/short_case/A_fix_face_with_aud.mp4" \
    --retarget_strategy "fix_face" \
    --reference_attention_weight 0.95 \
    --audio_attention_weight 3.0
cantonalex commented 6 months ago

no I purely want to make an existing video talk with new audio without the crop in the export.

There doesn't seem to be a natural lip replacement strategy for an existing video.

jasonisme123 commented 6 months ago

is a reference image always required? you can't just take a video and say apply audio to this video?

maybe you need wav2lip?

cantonalex commented 6 months ago

wav2lip quality is lessor than that of this project

faraday commented 5 months ago

@cantonalex I couldn't get what you mean in this issue either. You want to provide the video (your source video). Normally, extract kps sequence script does this:

Now, the source video is only for extracting the kps sequence and get the audio from it (for practical reasons). You give the reference image and it talks as in the video (hopefully).

What you want is:

Otherwise, extract kps sequence script can be modified to select the best representing reference frame (according to your criteria for quality).

cantonalex commented 5 months ago

@faraday simply put. I only want to change the lips on a video. So input and output video are the same the only difference is the lip movements.