Closed chillbert closed 3 months ago
The Ready Player Me avatars come with Oculus viseme blendshapes, and the TalkingHead project includes language-specific lip-sync modules that can convert words into sequences of Oculus visemes. For more information about the lip-sync modules, refer to Appendix C in the README. You can also check out ./modules/lipsync-en.mjs
as an example.
And yes, you are right, ElevenLabs provides timestamps that are used to synchronize these viseme sequences with the audio.
Note that the TalkingHead class also accepts visemes and viseme timestamps. So, when using the Microsoft Azure Speech SDK, you don't need to rely on the built-in lip-sync modules. Although Azure uses a slightly different viseme standard, mapping their viseme IDs to the Oculus standard, as shown in the project's test app, is straightforward.
How is this working behind the scene? I was using for another project azure speech with visemes but since eleven labs doesn't provide those, how do you make the blendshapes move for each phonem? What I could imagine is that elevenlabs also provides timestamps for each word (or phonem?) and you have somehow a mapping of phonems to mout position (shapekey positions)?