met4citizen / TalkingHead

Talking Head (3D): A JavaScript class for real-time lip-sync using Ready Player Me full-body 3D avatars.
MIT License
344 stars 105 forks source link

Azure Speech API can return VISEME information directly. Will you consider accessing Azure Speech? #2

Closed PatrickkZhao closed 9 months ago

PatrickkZhao commented 10 months ago

So you don't need to convert English words to visemes and can support any language

met4citizen commented 10 months ago

Thanks for the tip.

I just checked the Azure TTS support for visemes and according to their documentation here they only support viseme output for en-US.

My own use case for the class requires Finnish lip-sync. Furthermore, Google TTS gives enough free characters per month (4 million) so that for me the use is practically free. The same amount of characters would cost me 57€ per month using Azure.

That said, for Azure TTS users this would most likely improve the English lip-sync accuracy. Maybe the best approach here would be to extend the speakAudio functionality so that the calling app can include the optional viseme sequence. In this way the class would offer support to not only Azure by all other TTS vendors able to provide viseme IDs.

I will look into this.

met4citizen commented 10 months ago

I tried the Microsoft Speech Services REST API today and quickly realized that it cannot provide word boundaries or viseme IDs in real-time. Both of these features are only available through the asynchronous Batch Synthesis API. The calling app would have to poll the results, and according to statistics on Microsoft pages, the latency ranges from 10-20 seconds up to 2 minutes. While there are some use cases where this is ok, for real-time apps this is unacceptable. As a comparison, ElevenLabs' WebSocket API provides word/character boundaries in real-time with latency <1 second.

Please let me know if I have misunderstood something.

Nevertheless, I still believe that extending the speakAudio method to accept optional viseme information is a good idea, and I plan to implement it in case some of the TTS vendors releases real-time viseme support or someone wants to experiment with Microsoft's batch synthesis API (not open to free tier users).

met4citizen commented 9 months ago

The class method speakAudio now accepts audio, word information as well as visemes. Audio chunks and word information are mandatory whereas the viseme information is optional. If the visemes are not provided, the method uses the built-in lip-sync algorithms (Finnish/English) to generate them as before.

I also added an example of how to use the Microsoft Azure speech API to the example app. The neural voices sound nice, but since the REST API currently can't provide word boundaries or visemes in real-time, the lip-sync accuracy is no better than in Google TTS and in long sentences typically much worse than when using ElevenLabs. This lack of word timings in Microsoft API also means that the subtitles appear one sentence at the time instead of one word at the time.

met4citizen commented 9 months ago

Update:

I was informed that internally the ​​Microsoft Azure Speech SDK uses WebSocket API similar to ElevenLabs, and through this WebSocket API it is possible to get viseme IDs in real-time. There seems to be no documentation for this WebSocket API, so the safest bet would be to use their new Speech SDK for JavaScript.

While I haven't personally tested it, here are some general tips for those interested in integrating Azure TTS with visemes with TalkingHead:

Feel free to reach out if you have any further questions.

met4citizen commented 9 months ago

I did a quick proof-of-concept using Microsoft Azure Speech SDK. Even without fine-tuning the lip-sync accuracy is much better and the viseme output is supported for several different languages.

Here is a short demo video:

https://youtu.be/OA6LBZjkzJI

PatrickkZhao commented 9 months ago

Thank you very much for your response, and I apologize for just seeing it now. I should have mentioned earlier about using the Azure Speech SDK for TTS.

Initially, I was considering converting Azure Speech's viseme IDs into Oculus, but later I noticed that Azure Speech can directly return 3D blend shapes, like this.

image

This might allow direct control of model animation without the need to convert to Oculus.

Of course, the demo you provided is also very good. Thank you very much for your work. This is crucial for me.