Use and parse SSML to change voices, pitch, rate

rumkin / duotone-reader

Screen reading enhancement with duo-voice text reading.

https://rumkin.github.io/duotone-reader

5 stars 2 forks source link

Use and parse SSML to change voices, pitch, rate #3

Open guest271314 opened 4 years ago

guest271314 commented 4 years ago

The goal of this repository described at https://github.com/rumkin/duotone-reader/issues/2#issuecomment-559903302

This project is a research started to find an answer for the questions you ask. Current API design doesn't suppose independent research and realize already well-known solutions. I think it should be enhanced to make the speech synthesis itself better researchable to help independent developers to experiment with this technology to find more and more solutions.

is achievable by parsing an HTML or SSML document, see https://github.com/WICG/speech-api/issues/10, https://github.com/mhakkinen/SSML-issues, https://github.com/alia11y/SSMLinHTMLproposal, https://github.com/guest271314/SpeechSynthesisSSMLParser.

Changing voices is possible at any time after the voices are loaded with getVoices() and using onvoiceschanged event and/or parsing an SSML element where the voice is set, e.g.,

<voice name="english_rp" languages="en-US" required="name">${Math.E}</voice>

https://github.com/guest271314/SpeechSynthesisSSMLParser/blob/master/SpeechSynthesisSSMLParserTest.html#L525.

rumkin commented 4 years ago

Thanks for the links! As I understand SSML itself isn't supported by browsers now and utterance value is a text extracted from SSML markup. Thus manual SSML parsing breaks sentence and speech synthesis couldn't build a correct phrase consisted of several utterances with different voice settings. Am I correct?

Also as I understand SSML solves one part of the problem I've referred to: the speech related voice control. I think there should be more control on each level: sound output as an audio stream, filters for making voice sound softly or make it sound more metallic, emotion control (in the future), etc. I think there is some progress on it, but I'm just new in the question and do not know much. And I don't know where to look this up.

Also I think what is very important is to separate artistic or research usage (for book reading, games, personal communication) from service or every day usage (for work, learning, business communication). In the second case publisher should use semantic and pronunciation markup without specifying the voice characteristics. Because users should have full control and be able to choose voices themselves. And this voices should be set up in OS/Browser settings.

guest271314 commented 4 years ago

Thanks for the links! As I understand SSML itself isn't supported by browsers now and utterance value is a text extracted from SSML markup. Thus manual SSML parsing breaks sentence and speech synthesis couldn't build a correct phrase consisted of several utterances with different voice settings. Am I correct?

SSML parsing is not supported by browsers right now (https://bugs.chromium.org/p/chromium/issues/detail?id=795371; https://bugzilla.mozilla.org/show_bug.cgi?id=1425523). That is one reason composed an SSML parser.

Yes, it is possible to construct sentences with breaks, pitch, rate, and voice changes. You can load https://github.com/guest271314/SpeechSynthesisSSMLParser/blob/master/SpeechSynthesisSSMLParserTest.html and observe the output for yourself. So far have implemented , <s>, <break>, <prosody>, <say-as>, , and <voice> elements of https://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/.

Also as I understand SSML solves one part of the problem I've referred to: the speech related voice control. I think there should be more control on each level: sound output as an audio stream, filters for making voice sound softly or make it sound more metallic, emotion control (in the future), etc. I think there is some progress on it, but I'm just new in the question and do not know much. And I don't know where to look this up.

That is possible by adjusting the voice, pitch, and rate, and using <prosody> SSML element. There is still work to be done on the SSML parser that began. Feel free to fork the repository and add content to the test once you are up to speed on the specified functionality of SSML.

And this voices should be set up in OS/Browser settings.

Depending on the requirement it is possible to bypass Web Speech API (including the issues with implementations https://stackoverflow.com/questions/48504228/how-can-i-make-my-web-browser-speak-programmatically/48504311#48504311) altogether and use speech-dispatcher with espeak-ng (or other TTS engine) by way of Native Messaging https://stackoverflow.com/questions/48219981/how-to-programmatically-send-a-unix-socket-command-to-a-system-server-autospawne, or other means, e.g., WebAssembly. That means clarification of the requirement for this project is essential: what are you trying to achieve?

rumkin commented 4 years ago

I had read the source of SSML Parser and according to it you create a queue of utterances and then play them. But there is an issue with it: my browser pronounces such sentences with pauses between words. For example phrase "Hello, World!" would be pronounced differently when presented as one or two strings.

Here it is JSFiddle example: https://jsfiddle.net/rumkin/o3x5Lf96/. Have you checked this difference on your solution?

I think native SSML support will fix this issue. There is basic SSML support in Chromium, but I haven't checked yet what it can.

That is possible by adjusting the voice, pitch, and rate, and using SSML element. There is still work to be done on the SSML parser that began. Feel free to fork the repository and add content to the test once you are up to speed on the specified functionality of SSML.

I need to get deeper into SSML to be more confident with terminology. I saw in the spec such params as age and gender, and it seems pretty interesting to control this on the fly. But It's based on current generation of TTS technologies, which is still strict, I think if it will be enhanced with Neural Network, we can achieve more flexible solution and control more specific characteristics like emotion, tooth count, and other physical params.

Depending on the requirement it is possible to bypass Web Speech API

I'm aiming to enhance specification and work with W3C committee to make it a standard, not some kind of hack.

That means clarification of the requirement for this project is essential: what are you trying to achieve?

Currently It's hard to say without going deeper into this question. But the global goal is to make browser a complete solution required to create, test and use new TTS solutions, and give equal access to this technology for all engineers. So browser should be able to:

use custom software TTS delivered by network,
record generated speech as bytes array,
support high level API to control params with ease.
debug accessibility issues related to SpeechAPI.

This project's goals are to promote idea of two voice model and to demonstrate how easy it is to work and to experiment with Speech API in the browser. The next step is to create web site speech api accessibility debugger. And I think you solution fits well for this. But it should be reworked for better maintainability. I will think how I can help with that.

guest271314 commented 4 years ago

For example phrase "Hello, World!" would be pronounced differently when presented as one or two strings.

What is the expected result?

There is basic SSML support in Chromium

Are you certain? The last time checked Chromium had not implemented SSML parsing for Web Speech API.

I'm aiming to enhance specification and work with W3C committee to make it a standard

Web Speech API is currently under WICG umbrella.

Am banned from WICG for 1,000 years for fraudulent reasons concocted by that body. W3C did not contest that had signed up correctly, deleted account had created, and have content published under their umbrella right now which they cited as the reason as an issue with own account, thus, their conduct is fraudulent in nature and substance as well. Will not be able to contribute to the current specification as it is under WICG control right now.

If am able to help in any way give a ping.

rumkin commented 4 years ago

What is the expected result?

Have you listen an example I've attached? It displays pretty clear how different it sounds when you speak the whole phrase and when you speak it word by word. In the second case there would be additional pauses between words. Phase doesn't sound like regular speech and fall apart. It could change the sense of the messages on the sentences borders.

Are you certain? The last time checked Chromium had not implemented SSML parsing for Web Speech API.

Well, I did not check meticulously, but it accepted <?xml version="1.0" ...><speech>... without reading X M L VERSION....

Am banned from WICG for 1,000 years for fraudulent reasons concocted by that body. W3C did not contest that had signed up correctly, deleted account had created, and have content published under their umbrella right now which they cited as the reason as an issue with own account, thus, their conduct is fraudulent in nature and substance as well. Will not be able to contribute to the current specification as it is under WICG control right now.

Sad to hear that. I think the situation could be solved with working code and community support.

If am able to help in any way give a ping.

Sure. Thanks!

guest271314 commented 4 years ago

But there is an issue with it: my browser pronounces such sentences with pauses between words. For example phrase "Hello, World!" would be pronounced differently when presented as one or two strings.

That depends on what the input and expected result are.

If the requirement is to input a sentence without distinguisable gaps between the words, that is possible using the linked parser code.

If the requirement is a break <break> SSML element can be used.

The SSML within the linked code are tests.

Can you file an issue at the linked repository if the output is not as expected?

Well, I did not check meticulously, but it accepted <?xml version="1.0" ...><speech>... without reading X M L VERSION....

Chromium appears to have at least stripped XML from input text. Does not parse SSML

Cr and FF

Am still not sure exactly what output you are expecting that you are not able to achieve now?

guest271314 commented 4 years ago

What local TTS engine is speech-dispatcher connecting to at the OS you are currently using?

guest271314 commented 4 years ago

So that we are prospectively using the same code you can install espeak-ng https://github.com/espeak-ng/espeak-ng, and python3-speechd which contains the executable spd-conf where you can configure speech-dispatcher https://github.com/brailcom/speechd which Chromium creates a sockets connection to at startup when --enable-speech-dispatcher flag is set. Execute spd-conf to create a configuration file, set espeak-ng as the default speech synthesis engine, and the audio output service, e.g., Pulse, ALSA, etc. When speechSynthesis.getVoices() is executed we should have the same voices loaded, where "English_(America) espeak-ng" or any other voice can be selected, for uniformity of output to test, and record, for comparison.

guest271314 commented 4 years ago

@rumkin FWIW see https://gist.github.com/guest271314/59406ad47a622d19b26f8a8c1e1bdfd5

guest271314 commented 4 years ago

@rumkin Initial implementation of the proof-of-concept at previous post https://github.com/guest271314/native-messaging-espeak-ng. Provides a means to input SSML as a string or XML Document to use multiple voices, e.g.,

nativeMessagingEspeakNG(`<speak version="1.0" xml:lang="en-US">
    Here are <say-as interpret-as="characters">SSML</say-as> samples.
    Try a date: <say-as interpret-as="date" format="dmy" detail="1">10-9-1960</say-as>
    This is a <break time="2500ms"> 2.5 second pause.
    This is a <break> sentence break.<break>
    <voice name="Storm" rate="x-slow" pitch="0.25">espeak-<say-as interpret-as="characters">ng</say-as> using Native Messaging, Native File System </voice>
    and <voice name="en-afrikaans"> <sub alias="JavaScript">JS</sub></voice>
  </speak>`)
.then(async({input, phonemes, result}) => {
  console.log({input, phonemes, result});
  const audio = new Audio(URL.createObjectURL(new Blob([new Uint8Array(result)])));
  await audio.play();
}, console.error);

rumkin commented 4 years ago

Hi, thanks for this. I wish it be more widely adopted solution, which everyone can run in their browser without a need to install something in their system. But I understand it's not possible in the moment, so I'll be searching a way to make communicating with WhatWG.

I'll reopen this to help others to read this issue. Please don't close it.

guest271314 commented 4 years ago

I wish it be more widely adopted solution, which everyone can run in their browser without a need to install something in their system.

That is the purpose of the solution in the format provided by Native Messaging.

Either way code has to be installed in their system.

The code can be shipped in Chromium source code, as it is with Chromium OS (https://chromium.googlesource.com/chromiumos/third_party/espeak-ng/+/refs/heads/chrome), however, AFAICT SSML parsing is not enabled by default (https://github.com/pettarin/espeakng.js-cdn/issues/1).

Or, the code can be installed by the user and executed utilizing Native Messaging.

The former requires asking Chromium authors to ship existing code to achieve the requirement by default, which already have, more than once.

The latter provides front-end control over the entire process. Native Messaging might appear to be an "installation" initially, though that is not necessarily the case. bash or the shell can be used. Where espeak-ng, festival, text2wav, etc., or, indeed, any native program is already installed, that binary can be executed and stdout message sent to browser, without installing espeak-ng or Opus. Created the application in the form at the previous link in order to demonstrate the proof-of-concept, that is, the requirement is possible right now.

The code can be maintained and implemented by front-end for the front-end, as described in a linked answer above. https://github.com/simov/native-messaging includes a Firefox version as well.

Have not yet thoroughly tested the SSML parsing output of espeak-ng and compared to the tests that ran on the elements that have completed parsing in JavaScript.

If you find any errors with the code (during testing) do not hesitate to file an issue. Having completed the initial version am now exploring creating a virtual device for the audio output. Am able to get a direct MediaStream of the file (when input is not deleted, which is the default now) though Chromium implementation of --use-file-for-fake-audio-capture=/path/to/output.wav%noloop (https://chromium.googlesource.com/chromium/src/+/4cdbc38ac425f5f66467c1290f11aa0e7e98c6a3/media/audio/fake_audio_input_stream.cc) does not also provide an obvious or non-redundant means to determine when end of file is reached.

In any event, do not hesitate to file issue, PR, feature request, to improve the code.

guest271314 commented 4 years ago

Opus installation (for the purposes of the initial code included primarily to reduce file size) is not necessary. espeak and espeak-ng each output a WAV audio to a file or stdout.

guest271314 commented 4 years ago

I wish it be more widely adopted solution, which everyone can run in their browser without a need to install something in their system

bash version using AudioWorklet to output audio https://github.com/guest271314/native-messaging-espeak-ng/tree/bash-audioworklet.

Assumes espeak-ng is installed on system and in PATH.

It should be possible substitute "native-messaging-host-bash.sh" (https://github.com/guest271314/native-messaging-espeak-ng/blob/bash-audioworklet/host/native-messaging-host-bash.sh) for C, C++, Python, Rust, etc. language.

Note: We do not actually send the input text to the Native Messaing host or send back audio output as a file using the Native Messaging protocol due to the limitations on message size (https://developer.chrome.com/extensions/nativeMessaging) and processing input with bash (https://stackoverflow.com/a/24777120)

The maximum size of a single message from the native messaging host is 1 MB, mainly to protect Chrome from misbehaving native applications. The maximum size of the message sent to the native messaging host is 4 GB.

The message length must not exceed 1024*1024.

Instead we send a single character 0 and execute espeak-ng with f and -w options to read the file "input.txt" written by Native File System which we also use to get the file "output.wav" before removing each file from the local file system.