soumik-kanad / diff2lip

Other
321 stars 38 forks source link

Low res result for single video inference (especially for the teeth area). #17

Open loic-combis opened 9 months ago

loic-combis commented 9 months ago

Hello!

I'm having an issue with the single video inference.

Here is the input video (With the target audio):

https://github.com/soumik-kanad/diff2lip/assets/22350513/613697cb-dd06-4bc6-9c34-0b3c127df58a

Here is the output video:

https://github.com/soumik-kanad/diff2lip/assets/22350513/645e0c5d-20c1-43be-8d01-7e0f4a5dfb84

I believe the low res can be fixed with a face enhancer, however, the result around the teeth is poorer relative to the demo in the readme.

Any idea what would cause this?

soumik-kanad commented 9 months ago

I am not very sure what the issue might be in this specific case. If I had to guess it either could be that the resolution is higher than the model is trained on or the weird rasterization in the video.

loic-combis commented 9 months ago

What is the optimal resolution to have better results with the base model?

Can we fine tune the model on a specific face for better results? I'm wondering if we could smoothen the result one way or another because the lip movement is otherwise pretty good.

soumik-kanad commented 9 months ago

So, we trained on the VoxCeleb2 dataset which is one of the standard datasets but is not HD quality. From there the faces are cropped (from the forehead to the chin/ cheek to cheek) and this is used at 128x128 resolution. So, videos that are sharper than that resolution might suffer.

I guess one could easily fine-tune the model for a specific video (or even use LORA-style finetuning). We have not tried it. I would guess that it would be able to get sharper features than before. (It would probably be even better to train/finetune the model on sharper datasets like AVSpeech or something better.)

loic-combis commented 9 months ago

We tried with a lower resolution and the result is similar... How should we proceed to train the model? Can we do it with images of the face only? Or should we also add intial audio to map the lip movement?

Deltaidiots commented 2 months ago

Hi, let me know if you find the fix for that please having similar issue

Utk-bot commented 2 months ago

Same issue