I have two questions to consult.
Question1 :
In the paper, you use very few data to establish the phoneme-pose dict.For example, "8 min" for Mandarin. Though for common methods, I mean common NN-nets, larger trainning dataset may improve the performace. But since you use "dict" here, will the quality imporve much when the dataset getting larger? Have you do any test and get any conclusion? Or will you give any pre-judgment?
Qestions2: I am a newer in the subject of Audio-Visual problem. In the "Text-Driven Video Generation" in the section of "Related Works" in your paper, it seems few work directly use text as driven? Could you recommend any other papers or methods that do text-driven taking head synthesis?
Hi, thank you very much for your sharing!
I have two questions to consult. Question1 : In the paper, you use very few data to establish the phoneme-pose dict.For example, "8 min" for Mandarin. Though for common methods, I mean common NN-nets, larger trainning dataset may improve the performace. But since you use "dict" here, will the quality imporve much when the dataset getting larger? Have you do any test and get any conclusion? Or will you give any pre-judgment?
Qestions2: I am a newer in the subject of Audio-Visual problem. In the "Text-Driven Video Generation" in the section of "Related Works" in your paper, it seems few work directly use text as driven? Could you recommend any other papers or methods that do text-driven taking head synthesis?