tencent-ailab / V-Express

V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images.
2.03k stars 250 forks source link

fix: can use other size #26

Open steven850 opened 1 month ago

steven850 commented 1 month ago

You added this fix to be able to use other sizes, what sizes will the system accept? I cant seem to go above 640, anything above 640 gives this error 768 give this AssertionError: There are 0 faces in the 0-th frame. Only one face is supported."

tiankuan93 commented 1 month ago

You can try the following script. I just provided a simple example, you'd better find some clearer picture and video yourself.

python scripts/extract_kps_sequence_and_audio.py \
    --video_path "./test_samples/short_case/AOC/gt.mp4" \
    --kps_sequence_save_path "./test_samples/short_case/AOC/kps_768.pth" \
    --audio_save_path "./test_samples/short_case/AOC/aud.mp3" \
    --height 768 \
    --width 768

python inference.py \
    --reference_image_path "./test_samples/short_case/AOC/ref_768.png" \
    --audio_path "./test_samples/short_case/AOC/aud.mp3" \
    --kps_path "./test_samples/short_case/AOC/kps_768.pth" \
    --output_path "./output/short_case/talk_AOC_no_retarget_768.mp4" \
    --retarget_strategy "no_retarget" \
    --num_inference_steps 20 \
    --image_width 768 \
    --image_height 768

ref_768.png

ref_768

steven850 commented 1 month ago

That is what I ran, and that is what gives me the error if I go past 640 on the resolution. Traceback (most recent call last): File "Z:\vex\scripts\extract_kps_sequence_and_audio.py", line 38, in assert len(faces) == 1, f'There are {len(faces)} faces in the {frame_idx}-th frame. Only one face is supported.' AssertionError: There are 0 faces in the 0-th frame. Only one face is supported.

FurkanGozukara commented 1 month ago

i generated video even with 768x768 worked for me

but video has only 1 face

https://github.com/tencent-ailab/V-Express/issues/27

steven850 commented 4 weeks ago

I have tried hundreds of combinations now, 640 is the max it will do, anything above that it wont detect faces anymore. Im using the included video, short_case/TYS. Single face. I made some upscaled versions of the video, 1024, 1000, 768, 640 etc.

"python scripts/extract_kps_sequence_and_audio.py --video_path "./test_samples/short_case/tys/gt768.mp4" --kps_sequence_save_path "./test_samples/short_case/tys/kps768.pth" --audio_save_path "./test_samples/short_case/tys/gt.mp3" --height 768 --width 768"

Also tried selecting some of the other video sizes like 512 while specifying 768 to see what would happen. Same error.

If I select the 1024 video, but don't call for a higher resolution it works. so the problem is explicitly tied to calling for a specific resolution.

FurkanGozukara commented 4 weeks ago

here 768x768 i made not upscaled

https://github.com/tencent-ailab/V-Express/assets/19240467/8128c68c-88af-4f81-b10b-847333faaa30

KMiNT21 commented 3 weeks ago

You added this fix to be able to use other sizes, what sizes will the system accept? I cant seem to go above 640, anything above 640 gives this error 768 give this AssertionError: There are 0 faces in the 0-th frame. Only one face is supported."

The same issue happened to me. I just debugged this for hours. :) What I found is that when we use a 1024x1024 image inside retinaface.py, the face detection model gets BAD RESULTS, with confidence less than the threshold of 0.5. So, some inputs can work, but some do not.

Why? I haven't had time to figure it out yet. But it looks like the problem is with incorrectly using the model_ckpts\insightface_models\models\buffalo_l\det_10g.onnx model, which is 640x640. And if we pass an image size larger than that, some incorrect resizing happens.

FurkanGozukara commented 3 weeks ago

You added this fix to be able to use other sizes, what sizes will the system accept? I cant seem to go above 640, anything above 640 gives this error 768 give this AssertionError: There are 0 faces in the 0-th frame. Only one face is supported."

The same issue happened to me. I just debugged this for hours. :) What I found is that when we use a 1024x1024 image inside retinaface.py, the face detection model gets BAD RESULTS, with confidence less than the threshold of 0.5. So, some inputs can work, but some do not.

Why? I haven't had time to figure it out yet. But it looks like the problem is with incorrectly using the model_ckpts\insightface_models\models\buffalo_l\det_10g.onnx model, which is 640x640. And if we pass an image size larger than that, some incorrect resizing happens.

ah makes sense

my input video was also 512x512

KMiNT21 commented 3 weeks ago

So....

If we don't want to face "zero face" problems, there's an easy fix:

Change app.prepare(ctx_id=0, det_size=(args.image_height, args.image_width)) to app.prepare(ctx_id=0, det_size=(min(args.image_height, 640), min(args.image_width, 640))) (and make the same change inside extract_kps...py).

However, there's not much use for it. Image sizes larger than 768 just produce garbage results. As I can see from the V-Express paper at https://arxiv.org/pdf/2406.02511, the model was trained on 512x512 resolution. And, as far as I can understand, it uses the Stable Diffusion 1.5 model, which is also 512x512 (or 768?).

BUT! We have an extremely time-consuming way to increase quality by processing Video to Video with SUPIR-V0Q. It looks interesting.

And... there might be another way to test – use SUPIR for every Nth frame (dropping some frames) and perform frame interpolation. This way, the animation could be more accurate. But it's just an idea.