rhasspy / larynx

End to end text to speech system using gruut and onnx
MIT License
822 stars 49 forks source link

Dot (.) stops synthesis #27

Closed chainria closed 2 years ago

chainria commented 2 years ago

I am new to Larynx, so maybe my question can be answered easily and quickly, but I couldn't find anything to fix it.

Whenever a dot character is encountered, synthesis ends. I don't even need multiple sentences, but if it encounters something like X (feat. Y) it just says X feat. I am using Larynx over opentts in Home Assistant, but this can easily replicated in the GUI as well. So how exactly can I fix this? And maybe for later, how exactly can I synthesize multiple sentences? Thank you very much in advance, the voices are superb!

follower commented 2 years ago

Hi, to make this issue easier to debug it might be helpful to supply some additional information:

Certainly in my experience Larynx has synthesized multiple sentences without special handling, so there might be something about the setup that's not working properly.

What operating system/version is this ocurring on?

(Also, are these song titles? Have you tested with typing sample sentences directly in case there's an issue with possible hidden/special characters in the title?)

chainria commented 2 years ago

Hi!

I tried several voices in the GUI, they all did it. Didn't try all of them, but I can certainly give it a shot. I am using harvard right now. Edit: ALL of the voices I could try in the interface exhibit this problem. "Hello. This is two sentences." simply yields "Hello."

This is Home Assistant OS running on a Raspberry PI 4B with 4GB of RAM. I don't know how to start the script on the CLI since it is using Docker containers. If there is a way, I can gladly try. Edit: Tried using rhasspy. This does work like a treat. So it almost looks line an issue with OpenTTS?

I encountered it with song titles as well as a simple "Testing. Testing. Testing." and it stops at the first sentence. I also tried pasting multiple sentences from anywhere and it stopped at the first dot.

follower commented 2 years ago

Thanks for trying those other approaches & reporting back.

Based on your descriptions it does seem likely to be an issue around the OpenTTS integration.

I don't have any experience with that aspect of this project so can't give you any specific help for that, sorry.

In terms of debugging approach I'd look at how the text string gets passed through the different parts of the system to see if part of it is getting dropped along the way--maybe see if Home Assistant/OpenTTS logs the input/output text data during processing to see where/if it changes?

chainria commented 2 years ago

Thanks! I already assumed that I'll need to report this in OpenTTS itself, just thought I had to start somewhere. And since it doesn't seem to be larynx itself, I'll try that. Also I found how to enable debug and it seems it synthesizes the text in three completely different runs.

--debug --larynx-quality high --larynx-noise-scale 0.333 --larynx-length-scale 1.0 DEBUG:opentts:Namespace(cache=None, debug=True, flite_voices_dir=None, host='0.0.0.0', larynx_denoiser_strength=0.001, larynx_length_scale=1.0, larynx_noise_scale=0.333, larynx_quality='high', marytts_like=None, marytts_url=None, mozillatts_url=None, no_espeak=False, no_festival=False, no_flite=False, no_larynx=False, no_nanotts=False, port=5500) DEBUG:opentts:Loaded TTS systems: espeak, flite, festival, nanotts, marytts, larynx Running on 0.0.0.0:5500 over http (CTRL + C to quit) DEBUG:opentts:['espeak-ng', '--voices'] DEBUG:opentts:Festival voices: {'kal_diphone'} DEBUG:opentts:Loading voices from voices/marytts DEBUG:opentts:Voice(id='bits1-hsmm', name='bits1-hsmm', gender='female', language='de', locale='de', tag=None) DEBUG:opentts:Voice(id='dfki-pavoque-neutral-hsmm', name='dfki-pavoque-neutral-hsmm', gender='male', language='de', locale='de', tag=None) DEBUG:opentts:Voice(id='bits3-hsmm', name='bits3-hsmm', gender='male', language='de', locale='de', tag=None) DEBUG:opentts:['espeak-ng', '--voices'] DEBUG:opentts:Festival voices: {'kal_diphone'} INFO:opentts:Synthesizing with larynx:eva_k-glow_tts (23 char(s))... DEBUG:opentts:Synthesizing line 1 (23 char(s)) DEBUG:gruut.toksen:Number converter regex: ^-?\d+([,.]\d+)*\w+$ DEBUG:gruut.phonemize:Loading lexicon from voices/larynx/gruut/de-de/lexicon.db DEBUG:glow_tts:Loading model from voices/larynx/de-de/eva_k-glow_tts/generator.onnx DEBUG:hifi_gan:Loading HiFi-GAN model from voices/larynx/hifi_gan/vctk_small/generator.onnx DEBUG:opentts:TTS settings: {'noise_scale': 0.333, 'length_scale': 1.0} DEBUG:opentts:Vocoder settings: {'denoiserstrength': 0.001} DEBUG:larynx:{'': 0, '|': 1, '‖': 2, '#': 3, 'a': 4, 'aɪ̯': 5, 'aʊ̯': 6, 'aː': 7, 'b': 8, 'd': 9, 'd͡ʒ': 10, 'eː': 11, 'f': 12, 'g': 13, 'h': 14, 'iː': 15, 'j': 16, 'k': 17, 'l': 18, 'm': 19, 'n': 20, 'oː': 21, 'p': 22, 'p͡f': 23, 's': 24, 't': 25, 't͡s': 26, 't͡ʃ': 27, 'uː': 28, 'v': 29, 'x': 30, 'yː': 31, 'z': 32, 'ãː': 33, 'ç': 34, 'õː': 35, 'øː': 36, 'ŋ': 37, 'œ': 38, 'ɐ': 39, 'ɔ': 40, 'ɔʏ̯': 41, 'ə': 42, 'ɛ': 43, 'ɛː': 44, 'ɛ̃ː': 45, 'ɪ': 46, 'ʁ': 47, 'ʃ': 48, 'ʊ': 49, 'ʏ': 50, 'ʒ': 51, 'ʔ': 52, 'χ': 53} DEBUG:larynx:Words for 'Test.': ['test', '.'] DEBUG:larynx:Phonemes for 'Test.': ['#', 't', 'ɛ', 's', 't', '#', '‖', '‖'] DEBUG:larynx:Running text to speech model (GlowTextToSpeech) DEBUG:larynx:Words for 'Eins.': ['eins', '.'] DEBUG:larynx:Phonemes for 'Eins.': ['#', 'a', 'eː', 'n', 's', '#', '‖', '‖'] DEBUG:larynx:Running text to speech model (GlowTextToSpeech) DEBUG:larynx:Words for 'Zwei.': ['zwei', '.'] DEBUG:larynx:Got mels in 0.19924291200004518 second(s) (shape=(1, 80, 48)) DEBUG:larynx:Phonemes for 'Zwei.': ['#', 't͡s', 'v', 'aɪ̯', '#', '‖', '‖'] DEBUG:larynx:Running vocoder model (HiFiGanVocoder) DEBUG:larynx:Words for 'Drei.': ['drei', '.'] DEBUG:larynx:Phonemes for 'Drei.': ['#', 'd', 'ʁ', 'aɪ̯', '#', '‖', '‖'] DEBUG:larynx:Got mels in 0.29696504096500576 second(s) (shape=(1, 80, 62)) DEBUG:larynx:Running vocoder model (HiFiGanVocoder) DEBUG:hifi_gan:Initializing denoiser DEBUG:hifi_gan:Initializing denoiser DEBUG:hifi_gan:Running denoiser (strength=0.001) DEBUG:larynx:Got audio in 1.1020990899996832 second(s) (shape=(12288,)) DEBUG:larynx:Real-time factor: 0.42 (audio=0.56 sec, infer=1.31 sec) DEBUG:larynx:Running text to speech model (GlowTextToSpeech) DEBUG:opentts:Got 24620 WAV byte(s) for line 1 DEBUG:opentts:Synthesized 24620 byte(s) in 9.16156530380249 second(s) DEBUG:hifi_gan:Running denoiser (strength=0.001) DEBUG:larynx:Got audio in 1.1691214450402185 second(s) (shape=(15872,)) DEBUG:larynx:Real-time factor: 0.49 (audio=0.72 sec, infer=1.47 sec) DEBUG:larynx:Running text to speech model (GlowTextToSpeech) DEBUG:larynx:Got mels in 0.27170646691229194 second(s) (shape=(1, 80, 46)) DEBUG:larynx:Running vocoder model (HiFiGanVocoder) DEBUG:larynx:Got mels in 0.27694296499248594 second(s) (shape=(1, 80, 48)) DEBUG:larynx:Running vocoder model (HiFiGanVocoder) DEBUG:hifi_gan:Running denoiser (strength=0.001) DEBUG:larynx:Got audio in 0.377510052989237 second(s) (shape=(11776,)) DEBUG:larynx:Real-time factor: 0.82 (audio=0.53 sec, infer=0.65 sec) DEBUG:hifi_gan:Running denoiser (strength=0.001) DEBUG:larynx:Got audio in 0.28667956008575857 second(s) (shape=(12288,)) DEBUG:larynx:Real-time factor: 0.99 (audio=0.56 sec, infer=0.57 sec) INFO:opentts:Synthesizing with larynx:rebecca_braunert_plunkett-glow_tts (23 char(s))... DEBUG:opentts:Synthesizing line 1 (23 char(s)) DEBUG:glow_tts:Loading model from voices/larynx/de-de/rebecca_braunert_plunkett-glow_tts/generator.onnx DEBUG:opentts:TTS settings: {'noise_scale': 0.333, 'length_scale': 1.0} DEBUG:opentts:Vocoder settings: {'denoiser_strength': 0.001} DEBUG:larynx:Words for 'Test.': ['test', '.'] DEBUG:larynx:Phonemes for 'Test.': ['#', 't', 'ɛ', 's', 't', '#', '‖', '‖'] DEBUG:larynx:Running text to speech model (GlowTextToSpeech) DEBUG:larynx:Words for 'Eins.': ['eins', '.'] DEBUG:larynx:Phonemes for 'Eins.': ['#', 'a', 'eː', 'n', 's', '#', '‖', '‖'] DEBUG:larynx:Running text to speech model (GlowTextToSpeech) DEBUG:larynx:Words for 'Zwei.': ['zwei', '.'] DEBUG:larynx:Phonemes for 'Zwei.': ['#', 't͡s', 'v', 'aɪ̯', '#', '‖', '‖'] DEBUG:larynx:Words for 'Drei.': ['drei', '.'] DEBUG:larynx:Phonemes for 'Drei.': ['#', 'd', 'ʁ', 'aɪ̯', '#', '‖', '‖'] DEBUG:larynx:Got mels in 0.1456335949478671 second(s) (shape=(1, 80, 28)) DEBUG:larynx:Running vocoder model (HiFiGanVocoder) DEBUG:larynx:Got mels in 0.17054839001502842 second(s) (shape=(1, 80, 30)) DEBUG:larynx:Running vocoder model (HiFiGanVocoder) DEBUG:hifi_gan:Running denoiser (strength=0.001) DEBUG:larynx:Got audio in 0.20584573596715927 second(s) (shape=(7168,)) DEBUG:larynx:Real-time factor: 0.92 (audio=0.33 sec, infer=0.35 sec) DEBUG:larynx:Running text to speech model (GlowTextToSpeech) DEBUG:opentts:Got 14380 WAV byte(s) for line 1 DEBUG:opentts:Synthesized 14380 byte(s) in 5.937345743179321 second(s) DEBUG:hifi_gan:Running denoiser (strength=0.001) DEBUG:larynx:Got audio in 0.2447036859812215 second(s) (shape=(7680,)) DEBUG:larynx:Real-time factor: 0.83 (audio=0.35 sec, infer=0.42 sec) DEBUG:larynx:Running text to speech model (GlowTextToSpeech) DEBUG:larynx:Got mels in 0.15959876799024642 second(s) (shape=(1, 80, 28)) DEBUG:larynx:Running vocoder model (HiFiGanVocoder) DEBUG:larynx:Got mels in 0.16664397495333105 second(s) (shape=(1, 80, 26)) DEBUG:larynx:Running vocoder model (HiFiGanVocoder) DEBUG:hifi_gan:Running denoiser (strength=0.001) DEBUG:larynx:Got audio in 0.22502166801132262 second(s) (shape=(7168,)) DEBUG:larynx:Real-time factor: 0.84 (audio=0.33 sec, infer=0.39 sec) DEBUG:hifi_gan:Running denoiser (strength=0.001) DEBUG:larynx:Got audio in 0.1921218209899962 second(s) (shape=(6656,)) DEBUG:larynx:Real-time factor: 0.84 (audio=0.30 sec, infer=0.36 sec)

synesthesiam commented 2 years ago

Yep, this appears to be a bug in the OpenTTS integration. I messed up and assumed that sentences were split in a different place. I'll get this cleaned up and release a new version.

chainria commented 2 years ago

Thank you very much! I am looking forward to it :)

synesthesiam commented 2 years ago

Should be fixed now in OpenTTS 2.1 :+1: