Closed harisreedhar closed 3 months ago
Thank you for your question.
We also encountered similar issues during testing, especially with out-of-distribution data, where there were some changes in the appearance of individuals. This is likely due to insufficient data volume. Our model was trained using HDTF, with a training set of over 200 people, which is still relatively small, especially for diffusion models with scaling capabilities (200+ people vs. 1B model).
I suggest using datasets with more identities, such as VoxCeleb2, to retrain the first stage. The goal of the first stage should be to perfectly restore the conditioned mouth as much as possible. Then, retrain the second stage.
Additionally, I thought of another possible solution: instead of predicting noise during the diffusion training process, we predict the original image. On this basis, we add a loss term, which is the face recognition loss. The purpose of this is to ensure that the predicted image and the original image maintain as much identity consistency as possible. However, compared to the dataset, I still think the dataset issue is more significant.
Thanks
Thanks for the wonderful work. I tried both one shot & few shot approach but the likeness of face is bit off. Any tips to improve likeness?
https://github.com/liutaocode/DiffDub/assets/46858047/60ff1146-e433-44a6-8a87-4a08af0015c5