naver-ai / facetts

Apache License 2.0
45 stars 6 forks source link

The inference result generates only a female voice regardless of the input image. #2

Closed tskim9439 closed 2 months ago

tskim9439 commented 10 months ago

First of all, thank you for conducting the excellent research and for sharing the code.

I followed the instructions in the README, and successfully loaded the pretrained model. To test the model, I have downloaded images of other people like Will Smith, changed the image file name to face2.png, but it seems to keep producing the same voice of a female.

Is there any additional work required to perform TTS conditioned on human face images?

lee-jiyoung commented 10 months ago

Hi, sorry for the late response.

Did you use a face detector? Since we use face images in LRS3 for training, the face in the image should be aligned like LRS3.