met4citizen / TalkingHead

Talking Head (3D): A JavaScript class for real-time lip-sync using Ready Player Me full-body 3D avatars.
MIT License
350 stars 108 forks source link

Using elevenlabs even if elevenlabs API doesn't provide visemes per frames? #59

Closed chillbert closed 3 months ago

chillbert commented 3 months ago

How is this working behind the scene? I was using for another project azure speech with visemes but since eleven labs doesn't provide those, how do you make the blendshapes move for each phonem? What I could imagine is that elevenlabs also provides timestamps for each word (or phonem?) and you have somehow a mapping of phonems to mout position (shapekey positions)?

met4citizen commented 3 months ago

The Ready Player Me avatars come with Oculus viseme blendshapes, and the TalkingHead project includes language-specific lip-sync modules that can convert words into sequences of Oculus visemes. For more information about the lip-sync modules, refer to Appendix C in the README. You can also check out ./modules/lipsync-en.mjs as an example.

And yes, you are right, ElevenLabs provides timestamps that are used to synchronize these viseme sequences with the audio.

Note that the TalkingHead class also accepts visemes and viseme timestamps. So, when using the Microsoft Azure Speech SDK, you don't need to rely on the built-in lip-sync modules. Although Azure uses a slightly different viseme standard, mapping their viseme IDs to the Oculus standard, as shown in the project's test app, is straightforward.