rhasspy / larynx

End to end text to speech system using gruut and onnx
MIT License
822 stars 49 forks source link

Ideas for lipsync and visemes? #47

Open kinoc opened 2 years ago

kinoc commented 2 years ago

First, love the project !

I have a robotic and virtual agent project that I'm trying to get as close to real-time response as possible. I use the following to generate speech: python3 fastVoice.py | larynx -v ek --interactive --ssml --raw-stream --cuda --half --max-thread-workers 8 --stdin-format lines --process-on-blank-line| aplay -r 22050 -c 1 -f S16_LE Where fastVoice.py just dumps the SSML from a socket onto stdin (remember to flush properly ...) fastVoice.txt

All works very well. Audio generally starts <1s from receiving the message. The question is how to get a phoneme-viseme sequence synced with the audio output. I can manage to get level 0-ish lipsync by looking at the amplitude of the audio output, but that gives enough info for just the jaw, not the viseme's of the lips.

Do you have any ideas/pointers on how to maintain the responsiveness of "--raw-stream" while getting real-time matching info to generate the matching visemes?