rhdunn / espeak

eSpeak NG is an open source speech synthesizer that supports 101 languages and accents.
http://reecedunn.co.uk/espeak-for-android
GNU General Public License v3.0
386 stars 16 forks source link

Parsing ssml tags causes problems on Android #47

Closed pvagner closed 11 years ago

pvagner commented 11 years ago

This has been brought up by Tyler Spivey on an eyes-free email list.

Steps to reproduce:

Actual results: eSpeak would not read content between < [less] and > [greater] symbols.

Expected results: It should all be read no mather what is written there. This is how other android tts services operate.

I dont know if ssml processing should all be disabled or if characters < and > should just be escaped.

pvagner commented 11 years ago

As I have assumed this appear to be easy. There is a flag which tells eSpeak parse ssml while speaking. This has been enabled for ages as is even so in the original eyes-free version. I have modified it at two locations because there are different code paths for android 2.X and android 4.X.

diff --git a/android/jni/jni/eSpeakService.cpp b/android/jni/jni/eSpeakService.cpp index 19ea3cf..145c97c 100644 --- a/android/jni/jni/eSpeakService.cpp +++ b/android/jni/jni/eSpeakService.cpp @@ -325,7 +325,7 @@ JNICALL Java_com_reecedunn_espeak_SpeechSynthesis_nativeSynthesize( espeak_SetSynthCallback(SynthCallback); const espeak_ERROR result = espeak_Synth(c_text, strlen(c_text), 0, // position POS_CHARACTER, 0, // end position (0 means no end position)

diff --git a/android/jni/jni/espeakengine.cpp b/android/jni/jni/espeakengine.cpp index 26dbfec..2606489 100644 --- a/android/jni/jni/espeakengine.cpp +++ b/android/jni/jni/espeakengine.cpp @@ -583,7 +583,7 @@ tts_result TtsEngine::synthesizeText(const char text, int8_t buffer, size_t bu

espeak_Synth(text, strlen(text), 0, // position POS_CHARACTER, 0, // end position (0 means no end position)

pvagner commented 11 years ago

ooops, it is not a good idea to paste a code like this here... http://pastie.org/6470628

rhdunn commented 11 years ago

Looking at the nativeSynthesize implementation in eSpeakService.cpp, the espeak_Synth method is being called with the espeakSSML parameter causing the text to be handled as potentially containing SSML and HTML tags.

Using the desktop version of eSpeak (tested with 1.46.46), SSML/HTML handling appears to work as expected:

espeak -m "Hello <b>world</b>."
espeak -m "Hello < world."
espeak -m "Hello < world >."

All say world correctly.

Some other things to note:

  1. eSpeak will only recognise an SSML/HTML tag if there is no space between the less-than character and tag name (it does not matter about whitespace after), so <b> and <b > are recognized, but < b> and < b > are not.
  2. eSpeak will skip all text following a recognised tag until a greater-than character irrespective of the validity of the tag -- that is one <two three four> five is spoken as one five. While this is logical for XML/HTML content, the behaviour in mixed text/SSML handled by most text-to-speech programs is problematic (e.g. a simple a<b can cause the rest of the text to be ignored).
  3. eSpeak only recognises a subset of SSML/HTML tags (found in the ssmltags array in src/readclause.cpp). Any unrecognised tags are simply skipped. Again, while logical for XML/HTML content, this does not make sense for mixed text/SSML.
  4. eSpeak treats <! ... > as a comment instead of the correct <!-- ... -->.

The simplest thing to do to address this would be to remove the espeakSSML flag, but PICO and other text-to-speech applications handle SSML in text.

pvagner commented 11 years ago

The desktop versions correctly read hello when given a command espeak -m "hello " I have not checked other files but removing the flag appears to have fixed the issue. I will see whether it will bring other side effects. Maybe this is not correct solution but I dont know if there are some apps sending out ssml on android.

rhdunn commented 11 years ago

I have pushed a change that removes espeakSSML, similar to the patch you provided. The correct solution should be that Android informs the text-to-speech engine that the text contains SSML and the engine processes it accordingly, otherwise the engine processes it as normal text.

I am going to investigate this a bit more, including how PICO handles SSML.

rhdunn commented 11 years ago

Android does not support a "this is SSML" option in the API, so text-to-speech voices assume it is SSML (either with or without an explicit <speak> tag at the start). That is, they are processing it as pseudo-XML like the desktop text-to-speech engines do.

I am not sure how good these engines are at handling XML-like content, but I assume they only recognise SSML tags and anything that is not recognised or is not a valid XML tag they treat as text. That would be the best approach in mixed text/SSML content.

Doing that would require improving the SSML processing in eSpeak itself and should be fixed in the upstream version (thus fixing the behaviour on the desktop as well).

Also note that according to the XML spec, eSpeak's detection of start tags (note 1) in my comment above is correct -- a tag is only valid if there is no space between the less-than character and the first letter of the tag name.

pvagner commented 11 years ago

I must admit for me personally this has never been an issue. I am unable to find a real world use case where this breaks things right now. I have only reported this and suggested removing ssml flag because another user reported it to the eyes-free list. Should we ask him for more input or are we sticking with this ssml flag removed? Should we bring this to the espeak-general list in order to get some reasoning regarding the current implementation and try to politelly request the proposed enhancement to the ssml recognizer?

rhdunn commented 11 years ago

I have reported the issue to espeak-general. It would also be useful to gain more specific examples of what is broken in these cases.

I can imagine email containing code (e.g. if (x<y)) to be broken when using the SSML flags. Also, if using ASCII-based math (such as a<b) is broken. Not sure what else would be in the real world.

rhdunn commented 11 years ago

Note that Tyler's email mentions the "This is an example" text originating from a web page that is being read via Chrome. Also, documents (e.g. the README for the eyes-free android version of eSpeak) can use things like cd <path-to-project>/src. So this does occur in real-world situations.

The problem gets more interesting when the text being passed is an example of SSML or HTML tags that eSpeak recognises. For example, a website could have "This is <b>bold</b> text." which eSpeak will handle, but won't speak the b tags. The browser that passes this to eSpeak gets the less-than and greater-than characters escaped as < and >, so does not treat them as bold tags, but it passes them unescaped to the text-to-speech engine, so eSpeak will treat them as bold tags. Ideally, the web browser should pass these in their escaped form,

rhdunn commented 11 years ago

Aside from disabling SSML support for now and re-enabling when upstream add "text+SSML soup" support, there are several improvements that can be made to provide a better user experience.

  1. Add content sniffing for SSML and HTML content -- that is, enable espeakSSML if the text starts with <?xml, <speak, <html or <HTML.
  2. Add a configuration option to switch SSML behaviour -- include:

    a. "Text only."

    b. "Mixed text, SSML and HTML content."

The behaviour should then be:

  1. If the content sniffing matches an SSML or HTML document, use the espeakSSML flag;
  2. If the configuration option is mixed-text, use the espeakSSML flag;
  3. Otherwise, don't pass the espeakSSML flag (i.e. text-only).
rhdunn commented 11 years ago

I have now added content detection for SSML documents. If it is an SSML document, it will be processed as such, otherwise it will be processed as text. This is sufficient (no need for a user option or upstream enhancement).

NOTE: I have also added a simple test in the main activity so you can enter text to be spoken.

This will be included in the next update.