Closed PatrickkZhao closed 9 months ago
Thanks for the tip.
I just checked the Azure TTS support for visemes and according to their documentation here they only support viseme output for en-US.
My own use case for the class requires Finnish lip-sync. Furthermore, Google TTS gives enough free characters per month (4 million) so that for me the use is practically free. The same amount of characters would cost me 57€ per month using Azure.
That said, for Azure TTS users this would most likely improve the English lip-sync accuracy. Maybe the best approach here would be to extend the speakAudio functionality so that the calling app can include the optional viseme sequence. In this way the class would offer support to not only Azure by all other TTS vendors able to provide viseme IDs.
I will look into this.
I tried the Microsoft Speech Services REST API today and quickly realized that it cannot provide word boundaries or viseme IDs in real-time. Both of these features are only available through the asynchronous Batch Synthesis API. The calling app would have to poll the results, and according to statistics on Microsoft pages, the latency ranges from 10-20 seconds up to 2 minutes. While there are some use cases where this is ok, for real-time apps this is unacceptable. As a comparison, ElevenLabs' WebSocket API provides word/character boundaries in real-time with latency <1 second.
Please let me know if I have misunderstood something.
Nevertheless, I still believe that extending the speakAudio
method to accept optional viseme information is a good idea, and I plan to implement it in case some of the TTS vendors releases real-time viseme support or someone wants to experiment with Microsoft's batch synthesis API (not open to free tier users).
The class method speakAudio
now accepts audio, word information as well as visemes. Audio chunks and word information are mandatory whereas the viseme information is optional. If the visemes are not provided, the method uses the built-in lip-sync algorithms (Finnish/English) to generate them as before.
I also added an example of how to use the Microsoft Azure speech API to the example app. The neural voices sound nice, but since the REST API currently can't provide word boundaries or visemes in real-time, the lip-sync accuracy is no better than in Google TTS and in long sentences typically much worse than when using ElevenLabs. This lack of word timings in Microsoft API also means that the subtitles appear one sentence at the time instead of one word at the time.
Update:
I was informed that internally the Microsoft Azure Speech SDK uses WebSocket API similar to ElevenLabs, and through this WebSocket API it is possible to get viseme IDs in real-time. There seems to be no documentation for this WebSocket API, so the safest bet would be to use their new Speech SDK for JavaScript.
While I haven't personally tested it, here are some general tips for those interested in integrating Azure TTS with visemes with TalkingHead:
<mstts:viseme type='redlips_front'/>
inside the voice
tag.WordBoundary
and VisemeReceived
event handlers to collect word boundary information and viseme informationaudio
array of the audio object, word boundaries into words
, wtimes
, and wdurations
arrays, and viseme information into visemes
, vtimes
, and vdurations
arrays.speakAudio
method.Feel free to reach out if you have any further questions.
I did a quick proof-of-concept using Microsoft Azure Speech SDK. Even without fine-tuning the lip-sync accuracy is much better and the viseme output is supported for several different languages.
Here is a short demo video:
Thank you very much for your response, and I apologize for just seeing it now. I should have mentioned earlier about using the Azure Speech SDK for TTS.
Initially, I was considering converting Azure Speech's viseme IDs into Oculus, but later I noticed that Azure Speech can directly return 3D blend shapes, like this.
This might allow direct control of model animation without the need to convert to Oculus.
Of course, the demo you provided is also very good. Thank you very much for your work. This is crucial for me.
So you don't need to convert English words to visemes and can support any language