liutaocode / DiffDub

DiffDub: Person-generic visual dubbing using inpainting renderer with diffusion auto-encoder
https://liutaocode.github.io/DiffDub/
Apache License 2.0
29 stars 3 forks source link

Reference Image Concatenation #3

Open kradkfl opened 3 weeks ago

kradkfl commented 3 weeks ago

Hi!

Thanks for open-sourcing your code, it's helpful to see a reference implementation. I did have a question though:

In your paper, you don't mention concatenating a "reference image" as an additional input the stage 1 model, but the code seems to have this. Is this required to achieve similar results to the demos? If so, did you find there was any benefit to more than 1?

liutaocode commented 3 weeks ago

Yes, your observation is correct. In fact, we have also tested the number of frames stitched in this first stage, with the number ranging from 0 to 10. We found that the difference is not significant, and there are a few key reasons:

  1. Most of the HDTF dataset consists of frontal faces, with fewer multi-angle shots, so the dataset difficulty is not high.
  2. The 512-dimensional motion latent already includes mouth-related information, so providing more reference frames is not very meaningful.

Considering that a talking head model supporting arbitrary speakers will have at least one frame that can be used as a reference, all of our results use a model (first stage) with a "reference image" count of 1 (N=1) for inference, in order to maximize the utilization of information as much as possible.

My suggestions are as follows:

kradkfl commented 3 weeks ago

Thanks! Did you find that, for in the wild predictions, the additional reference frames helped even with a model solely trained on HDTF? Or did training with HDTF encourage the model to ignore the reference frames during training?

liutaocode commented 2 weeks ago

Hello. We haven't conducted experiments beyond HDTF, but I try to analyze the situation logically.

Intuitively, adding references during the diffusion rendering stage seems to be effective. Here, the motion latent space we use (512 dimensions) includes both motion and color information. So, it is challenging to perfectly reconstruct the masked area both in motion and color; theoretically, using some reference frames can result in better reconstruction, as the color information can be derived from the references, allowing the latent space to focus more on motion.

This analysis is supported by the 7th row of Table 1 in reference [1], which shows that a 512-dimensional latent space alone is not sufficient for restoring any arbitrary facial image. However, when modeling smaller areas, such as the mouth area, the situation may be different:

(1) For in-the-wild datasets, the latent space may struggle to accurately replicate areas within the mask, as it needs to accommodate facial imagery from any individual. I suggest adding concatenation to allow the latent space to focus more on motion.

(2) For HDTF, based on our tests with N ranging from 0 to 10 showing no difference, it is likely unnecessary. Given that HDTF features only about 300 people, the diffusion model might easily learn the distribution of these individuals's mouth area, leading to overfitting. This could be due to what you mentioned: "training with HDTF encourages the model to ignore the reference frames."

Reference: [1] Preechakul, K., Chatthee, N., Wizadwongsa, S., et al. Diffuser Autoencoders: Toward a Meaningful and Decodable Representation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 10619-10629.