sibozhang / Text2Video

ICASSP 2022: "Text2Video: text-driven talking-head video synthesis with phonetic dictionary".
https://sites.google.com/view/sibozhang/text2video
421 stars 92 forks source link

Will quality improve much when the dataset for getting the dict become larger? #4

Closed Liujingxiu23 closed 2 years ago

Liujingxiu23 commented 3 years ago

Hi, thank you very much for your sharing!

I have two questions to consult. Question1 : In the paper, you use very few data to establish the phoneme-pose dict.For example, "8 min" for Mandarin. Though for common methods, I mean common NN-nets, larger trainning dataset may improve the performace. But since you use "dict" here, will the quality imporve much when the dataset getting larger? Have you do any test and get any conclusion? Or will you give any pre-judgment?

Qestions2: I am a newer in the subject of Audio-Visual problem. In the "Text-Driven Video Generation" in the section of "Related Works" in your paper, it seems few work directly use text as driven? Could you recommend any other papers or methods that do text-driven taking head synthesis?

sibozhang commented 2 years ago
  1. I think Larger dataset will help.
  2. There is not much text-driven work yet