SSML file not processing under --ssml flag

PeterSprague commented 2 years ago

Testing both Larynx and Larynx.server install via pip3 in a venv. All dependencies are satisfied. Fedora 34 all up to date.

Using the example SSML in a file TTS-SSML_test.txt: larynx.server --> input contents of file into input box and run. SSML checkbox unchecked or checked = voice recognizing ssml cmds and not reading them

Using larynx from cmd line: $ python3 -m larynx -v southern_english_female-glow_tts < TTS-SSML_test.txt reads whole file including all the SSML statements

$ python3 -m larynx --ssml -v southern_english_female-glow_tts < TTS-SSML_test.txt errors: Traceback (most recent call last): File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/text_processor.py", line 479, in process root_element = etree.fromstring(text) File "/usr/lib64/python3.9/xml/etree/ElementTree.py", line 1348, in XML return parser.close() xml.etree.ElementTree.ParseError: no element found: line 1, column 7

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/TextToSpeech/venv/lib64/python3.9/site-packages/larynx/main.py", line 720, in main() File "/TextToSpeech/venv/lib64/python3.9/site-packages/larynx/main.py", line 294, in main for result_idx, result in enumerate(tts_results): File "/TextToSpeech/venv/lib64/python3.9/site-packages/larynx/init.py", line 71, in text_to_speech for sentence in gruut.sentences( File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/init.py", line 79, in sentences graph, root = text_processor(text, lang=lang, ssml=ssml, *process_args) File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/text_processor.py", line 432, in call return self.process(args, **kwargs) File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/text_processor.py", line 483, in process root_element = etree.fromstring(f"{text}") File "/usr/lib64/python3.9/xml/etree/ElementTree.py", line 1348, in XML return parser.close() xml.etree.ElementTree.ParseError: no element found: line 1, column 22

Also tried piping the file in via cat: cat TTS-SSML_test.txt | python3 -m larynx --ssml -v southern_english_female-glow_tts Same error Produces audio file without the --ssml flag, but as above includes all the SSML statements

Been through the documentation page and tried the examples to narrow this down. There is nothing specific to using a SSML specific file to produce the audio. Non-SSML examples all work on my workstation

Would like to get this working for a small project that produces training audio files of Shorin-Ryu Karate Yakusokus for my Black belt test practice

Thanks,

synesthesiam commented 2 years ago

Hi @PeterSprague, thanks for trying out Larynx :slightly_smiling_face:

Can you post an example of your SSML? I can't seem to reproduce the issue on my machine. Maybe I have something wrong off my SSML parser.

PeterSprague commented 2 years ago

I directly copied your SSML example from the README: TTS-SSML_test.txt

$ python3 -m larynx --ssml -v southern_english_female-glow_tts < TTS-SSML_test.txt

synesthesiam commented 2 years ago

OK, I see what's happening now. The command-line interface for Larynx is line-based -- it assumes each line is an individual utterance. If you remove the newline characters, it should work fine.

I may need to consider if --ssml should imply reading the entire input as one utterance, or of some other flag should indicate this.

PeterSprague commented 2 years ago

remove the newline characters

I'm missing something here. Are you saying to create a mixed blob of text and ssml cmds? How is that even decipherable by a human writer once the file gets more than a few "sentences"?

Here is a copy of my espeak-ng ssml file that is working well. Other than voice name this should also be able to be read by Larynx Yakusoku-6_attacker_detail_TTS-SSML-Espeak.txt

$ espeak-ng -f Yakusoku-6_attacker_detail_TTS-SSML-Espeak.txt -s 150 -p 50 -l 30 -k20 -m

synesthesiam commented 2 years ago

No, I'm suggesting something like this as a workaround:

tr < Yakusoku-6_attacker_detail_TTS-SSML-Espeak.txt '\n' ' ' | bin/larynx --ssml -v en-us

If the input all goes on one line into Larynx, it will be read correctly. This is intended to allow multiple sets of sentences to come in, like:

<speak>1st set of sentences</speak>
<speak>Next set of sentences</speak>
...

but I think with SSML, people will expect it to read the entire input at once.

PeterSprague commented 2 years ago

OK, stripping the newline as it "reads the file.

$ tr < Yakusoku-6_attacker_detail_TTS-SSML-Espeak.txt '\n' ' ' | python3 -m larynx --ssml -v en-us

Works well, thanks

When do you think you will be adding to the SSML set to give increased control over the delivery?

synesthesiam commented 2 years ago

What sorts of SSML tags do you think would be most useful?

PeterSprague commented 2 years ago

TTS and SSML very new to me, with my background being more on computer-vision and DL to assess ecological impacts.

I guess it really comes back to interests and/or business case. Are you wanting to create a self-hosted TTS solution using ML technigues to provide alternatives to Azure or Google? Then I would follow their sub-sets of SSML. Otherwise if wanting to use for more specific cases, then honing the sub-set to what enhances that usage might be the preferred development direction.

For my usage, based on https://www.w3.org/TR/speech-synthesis11/#S3.2, I think having control of the voice characteristics via "3.2.4 prosody Element" would be good.

synesthesiam commented 2 years ago

Fixed the --ssml input mode in Larynx 1.1 (it now reads the entire input).

Regarding prosody, I can control the rate and volume with GlowTTS (Larynx's TTS model), but pitch and contour aren't something that can be changed in the model.

rhasspy / larynx

SSML file not processing under --ssml flag #36