johnfactotum / foliate

Read e-books in style
https://johnfactotum.github.io/foliate/
GNU General Public License v3.0
6.43k stars 296 forks source link

PicoTTS offline Text-to-Speech support #841

Open Moonbase59 opened 2 years ago

Moonbase59 commented 2 years ago

Is your feature request related to a problem? Please describe. I wanted to add offline TTS (Text-to-Speech), and I’m not happy with eSpeak or Festival, but use PicoTTS for many other things already (it supports EN, DE, FR, IT, ES).

Describe the solution you'd like Asssuming that most modern Linux systems already have sox and use PulseAudio, I wrote a little output script to be used with Foliate. Just copy into your ~/bin folder or another appropriate location and make it executable (chmod +x foliate-picotts).

Describe alternatives you've considered eSpeak, gTTS

Additional context

Here is my script – feel free to include it with your software and/or website!

#!/bin/bash

# foliate-picotts -- Speak Foliate e-book using PicoTTS and PulseAudio
#
# Requirements:
# - pico2wave -- sudo apt install libttspico-utils
# - paplay -- Most modern systems now use PulseAudio for output
# - sox -- sudo apt install sox
# - sed -- POSIX standard command
#
# Use F5 within Foliate to start/stop speech.
#
# 2021-11-06 -- Matthias C. Hormann aka Moonbase59
#   - Code cleanup, bugfixing, added sox post-processing for better understanding.
#   - Added hypenation (en-dash, em-dash) support.
# 2021-11-09 -- Matthias C. Hormann aka Moonbase59
#   - Fixed paragraph (quoting) problem for pauses between paragraphs.
#   - More code cleanup.
#   - Reduce PicoTTS volume to 60% to avoid clipping.
#   - Added lots of pronunciation hints and corrections.
# 2022-01-01 -- Matthias C. Hormann aka Moonbase59
#   - Final adaption for Foliate as TTS output script.
#   - Added F5 start/stop handling (SIGINT to script by Foliate)

text=$(cat) # get text from stdin into text buffer

# for debugging only
#echo "$text" > /tmp/foliate-picotts.txt

# Remove some oddities inserted by Foliate
# Remove inserted ";"
text=$(echo "$text" | sed 's|; | |g')
# Redo "---" to emdash
text=$(echo "$text" | sed 's|---|—|g')
# Redo "--" to ndash
text=$(echo "$text" | sed 's|--|–|g')
# Redo "...." to dot plus ellipsis
text=$(echo "$text" | sed 's|\.\.\.\.|\.…|g')
# Redo "..." to ellipsis
text=$(echo "$text" | sed 's|\.\.\.|…|g')

# General text preprocessing
# Insert pauses for en- and em-dashes, recognize from<en-dash>to numeric ranges
# en-dash between digits spoken as "bis"
text=$(echo "$text" | sed 's|\([[:digit:]]\{1,\}\)–\([[:digit:]]\{1,\}\)|\1 bis \2|g')
# en-dash with white space around it spoken as a pause
text=$(echo "$text" | sed 's|[[:space:]]–[[:space:]]|<break time="500ms"/>|g')
# em-dash always spoken as a pause
text=$(echo "$text" | sed 's|—|<break time="500ms"/>|g')

# Map Foliate (=ebook) language to PicoTTS language (restricted set)

# for debugging only
#echo "$FOLIATE_TTS_LANG_LOWER" > /tmp/foliate-picotts-lang.txt

case "${FOLIATE_TTS_LANG_LOWER:0:2}" in

    "de")
        lang="de-DE"
        # Abbreviations, acronyms and special characters
        # Note: Some are case-sensitive, others not.
        #       Some are full words only, others can be part of other words.
        #       Some can only be pronounced correctly using (XSAMPA) phonemes.
        #       Check RegExp flag "I" and "\b" word boundaries!
        # change "§" ("Paragrafzeichen") -> "Paragraf"
        text=$(echo "$text" | sed 's|§§|Paragrafen |g')
        text=$(echo "$text" | sed 's|§|Paragraf |g')
        # "S." -> "Seite"
        text=$(echo "$text" | sed 's|\bS\.|Seite|g')
        # "ff." -> "und folgende" (as in "S. 23 ff." or "S. 23ff.")
        text=$(echo "$text" | sed -E 's/(\b|[[:digit:]])(ff\.)/\1 und folgende/g')
        # change "Abs." ("Absender") -> "Absatz"
        text=$(echo "$text" | sed 's|\bAbs\.|Absatz|g')
        # "i." as in "Weißenburg i. Bay." -> "in" ("Bay." already recognized as "Bayern")
        text=$(echo "$text" | sed 's|\bi\.|in|g')
        # "Co." ("Compagnon") -> "Co."
        text=$(echo "$text" | sed 's|\bCo\.|Koh|g')
        # "MdB"/"M.d.B." -> "Mitglied des Bundestags"
        text=$(echo "$text" | sed 's|\bM\.d\.B\.|Mitglied des Bundestags|g')
        text=$(echo "$text" | sed 's|\bMdB\b|Mitglied des Bundestags|g')
        # "NASA"
        text=$(echo "$text" | sed 's|\bNASA\b|<phoneme ph=\"\\\"na:.za:\"/>|g')
        # "WLAN"
        text=$(echo "$text" | sed 's|\bWLAN\b|Weh-Lahn|g')
        # MPEG variants
        text=$(echo "$text" | sed 's|\bmpe\?g\b|M peck|gI')
        # JPEG variants
        text=$(echo "$text" | sed 's|\bjpe\?g\b|Tschähpeck|gI')
        # PNG
        text=$(echo "$text" | sed 's|\bpng\b|P N G|gI')
        # MP2, MP3, MP4
        text=$(echo "$text" | sed -E 's/\bmp(2|3|4)\b/M P \1/gI')

        # Other minor corrections
        # "geil"
        text=$(echo "$text" | sed 's|geil|gaihl|gI')
        # "Sex"
        text=$(echo "$text" | sed 's|\bSex\b|<phoneme ph=\"\\\"s\\Eks\"/>|g')
        # "asexuell"
        text=$(echo "$text" | sed 's|\basexuell|a-sexuell|gI')

        # Anglicisms
        # "Wow"
        text=$(echo "$text" | sed 's|\bwow\b|<phoneme ph=\"\\\"v\\a_U:\"/>|gI')
        # "University"
        text=$(echo "$text" | sed 's|\buniversity\b|Juni-Wörßiti|gI')
        # "nature"
        text=$(echo "$text" | sed 's|\bnature\b|Nätscher|gI')
        # "software"
        text=$(echo "$text" | sed 's|\bsoftware\b|ßoftwär|gI')
        # "smartphone"
        text=$(echo "$text" | sed 's|\bsmartphone|Smartphon|gI')
        ;;

    "en")
        lang="en-GB"
        # Abbreviations, acronyms and special characters
        # Note: Some are case-sensitive, others not.
        #       Some are full words only, others can be part of other words.
        #       Some can only be pronounced correctly using (XSAMPA) phonemes.
        #       Check RegExp flag "I" and "\b" word boundaries!
        # "&" ("ampersand") -> "and" (but not for HTML entities like &ndash;
        text=$(echo "$text" | sed 's|[[:blank:]]&[[:blank:]]| and |g')
        # change "§§" ("section section") -> "sections"
        text=$(echo "$text" | sed 's|§§|sections|g')
        # "p." -> "page", "pp." -> "pages"
        text=$(echo "$text" | sed 's|\bpp\.|pages|g')
        text=$(echo "$text" | sed 's|\bp\.|page|g')
        # "ff." -> "and following" (as in "p. 23 ff." or "p. 23ff.")
        text=$(echo "$text" | sed -E 's/(\b|[[:digit:]])(ff\.)/\1 and following/g')
        # "HTML"
        text=$(echo "$text" | sed 's|\bHTML\b|H T M L|g')
        # MPEG variants
        text=$(echo "$text" | sed 's|\bmpe\?g\b|M peg|gI')
        # JPEG variants
        text=$(echo "$text" | sed 's|\bjpe\?g\b|J peg|gI')
        # PNG
        text=$(echo "$text" | sed 's|\bpng\b|P N G|gI')
        # MP2, MP3, MP4
        text=$(echo "$text" | sed -E 's/\bmp(2|3|4)\b/M P \1/gI')

        # Other minor corrections
        ;;

    "fr")
        lang="fr-FR"
        ;;

    "it")
        lang="it-IT"
        ;;

    "es")
        lang="es-ES"
        ;;

    *)
        lang="en-US"
        ;;
esac

# reduce volume to avoid clipping
text="<volume level=\"60\">$text</volume>"

# for debugging only
#echo "$text" > /tmp/foliate-picotts-finished.txt

# cretae WAV audio file using PicoTTS
pico2wave -l=$lang -w=/tmp/foliate.wav "$text"

# use sox to make output better understandable (voices are rather muffled)
# adding some treble in the range of +3 to +6 dB helps
# some voices might need a little bass reduction, use s/th like "bass -6 400"
# to avoid clipping, give headroom (gain -h) and reclaim afterwards (gain -r)
case "${FOLIATE_TTS_LANG_LOWER:0:2}" in
    "de")
        sox /tmp/foliate.wav /tmp/foliate-sox.wav gain -h treble +6 gain -r
        ;;
    "en")
        sox /tmp/foliate.wav /tmp/foliate-sox.wav gain -h treble +3 gain -r
        ;;
    "fr")
        ;;
    "it")
        ;;
    "es")
        ;;
    *)
        ;;
esac

# Prepare cleanup in case we get killed
function cleanup() {
    rm /tmp/foliate.wav
    rm /tmp/foliate-sox.wav
    rm /tmp/foliate-picotts.pid
}

# use PulseAudio to play in the background, remember player's PID
paplay /tmp/foliate-sox.wav > /dev/null & PID=$!
# save PID of player for possible xsay-kill on long texts
echo $PID >/tmp/foliate-picotts.pid

# trap SIGINT (issued by Foliate on 2nd F5)
trap "kill $PID; cleanup; exit 0" INT

# go wait for the spoken output (usually one page)
wait $PID

# Remove temporary files after paplay has finished (or was aborted)
cleanup
johnfactotum commented 2 years ago

Thanks a lot for this.

In the future I plan to switch to speech-dispatcher (which has a Pico module, I think). And more importantly Foliate needs to properly parse and extract the contents of the book, including any SSML markup (though I don't know how many books actually include those). See #829. Then we can also keep our own set of default pronunciation tweaks if no pronunciation info is included in the book.

Moonbase59 commented 2 years ago

You’re welcome! Trying spd-say, I couldn’t find a PicoTTS option, but I didn’t really look very closely. The standard spd voices sound rather robotic on my system. SSML might be a nice option, though I think almost no one uses it.

As you see in my script, PicoTTS also has scripting options (which I use heavily in my home automation). Too bad they were bought and put in the drawer… they had great voices, back in the days.

Should you switch over to spd—or something else—please don’t remove the scripting possibility! There’s still much to be gained when writing some adaptations (just check sound variation and some pronunciation help I add by "brute-force" sed). This gives Foliate a real advantage. (I’m also using Calibre’s reader which can only handle PicoTTS unmodified, and it’s much worse.)

Is there a reason that there are no linebreaks (for paragraph separation, which PicoTTS uses) and changing back ndashes, mdashes, ellipses to their ASCII equivalents? And the many semicolons added? All these I had to undo again to make it pronounce better.

johnfactotum commented 2 years ago

Is there a reason that there are no linebreaks [...] And the many semicolons added?

As I mentioned, Foliate currently does not parse and extract content properly. By "not properly" I mean that it uses Range.toString() (of the DOM API), which preserves all whitespace from the text nodes (I think), which means it's very much possible to have zero whitespace between paragraph elements, and newlines in the source will be preserved even though they aren't supposed to be rendered.

Another problem is that it speaks each page separately so that Foliate can turn to the next page when it finishes speaking. This approach obviously has many problems.

So this is mainly what I want to change. For example, it could process the document and insert linebreaks at block element boundaries. That would be much better than Range.toString(). If the TTS program supports marks or other kinds of events, then Foliate should feed the whole page or element to the TTS program, and use marks to handle highlighting and page turning.


Speech-dispatcher is unrelated to all issues above. I want to switch to that for different reasons.

The first is that I do not want to reinvent the wheel. Currently Foliate is already sort of a very poor man's speech-dispatcher. It has the advantage of having a much, much simpler interface, but it lacks features such selecting different voice, speed, etc.

The second is security. In a sandbox environment, ideally you don't want to allow Foliate to run arbitrary commands outside the sandbox. Speech-dispatcher is itself configurable and extensible, so there's should be no significant loss of customizability if we limit access to only speech-dispatcher in the sandbox.

The last reason is that it is already used by many other apps such as Firefox or Chromium. So in a sense it might make things easier for users (no need to configure different apps separately).

But really, Foliate should not even care or know about TTS programs. Ideally it should just use the SpeechSynthesis Web API. It would help make Foliate's code more reusable and portable as the Web API can be run on any browser on any platform. Unfortunately that's not supported by WebKitGTK, which is ideally where all this TTS code should live, where it would also benefit other WebKitGTK apps like Epiphany. So that is why I wrote in the other issue that while it would use speech-dispatcher, we should still use the SpeechSynthesis API and only defer to speech-dispatcher under the hood.

Should you switch over to spd—or something else—please don’t remove the scripting possibility! There’s still much to be gained when writing some adaptations (just check sound variation and some pronunciation help I add by "brute-force" sed).

I do understand the value in that, but really it's more of a by-product of the fact that TTS support in Foliate is extremely barebones. You can even abuse the TTS command to launch other non-TTS programs, for example. But that's not really how it's meant to be used.

Design-wise speaking, this is no different from injecting userstyles or userscripts to modify the content of the book. So ideally, if this kind of scripting is to be supported by Foliate, it should be done properly with a proper plugin or userscript API.

Also it could be argued that for forcing a certain pronunciation, one should be able to configure it in the TTS program, rather than doing it specifically for Foliate (provided that the content extraction issues mentioned above are fixed in Foliate).

Moonbase59 commented 2 years ago

All your points are valuable—and correct. Let’s see how it eventually evolves, looking forward to it!

And yes, of course I’m brute-forcing a lot here, because TTS on Linux is still not too great, and we sadly won’t get any more development on PicoTTS.

Lume6 commented 2 years ago

Thank you for Foliate and the script which works perfectly with Foliate !! Could you tell me how to find phonetic for adding some words for the french language. For instance, "Windows" ( I'm planing a demonstration with Linux). I wrote this but it failed :

Windows

    ext=$(echo "$text" | sed 's|\bwindows\b|Win doz|gI')

Thanks a lot !

Moonbase59 commented 2 years ago

@Lume6: It may be possible I didn’t check out all languages, resulting in the file /tmp/foliate-sox.wav missing. You can try a simple text on the command line like so:

FOLIATE_TTS_LANG_LOWER='fr'; echo "J'utilise Windows 10." | foliate-picotts

If you get something like an open() error and a message that it can’t delete /tmp/foliate-sox.wav then it’s my fault… sorry for that.

Change this part in the script to have sox commands for all languages as follows:

# use sox to make output better understandable (voices are rather muffled)
# adding some treble in the range of +3 to +6 dB helps
# some voices might need a little bass reduction, use s/th like "bass -6 400"
# to avoid clipping, give headroom (gain -h) and reclaim afterwards (gain -r)
case "${FOLIATE_TTS_LANG_LOWER:0:2}" in
    "de")
        sox /tmp/foliate.wav /tmp/foliate-sox.wav gain -h treble +6 gain -r
        ;;
    "en")
        sox /tmp/foliate.wav /tmp/foliate-sox.wav gain -h treble +3 gain -r
        ;;
    "fr")
        sox /tmp/foliate.wav /tmp/foliate-sox.wav gain -h treble +3 gain -r
        ;;
    "it")
        sox /tmp/foliate.wav /tmp/foliate-sox.wav gain -h treble +3 gain -r
        ;;
    "es")
        sox /tmp/foliate.wav /tmp/foliate-sox.wav gain -h treble +3 gain -r
        ;;
    *)
        cp /tmp/foliate.wav /tmp/foliate-sox.wav
        ;;
esac

(You can adjust the treble +3 to whatever is appropriate for French.)

For "Windows" (the operating system), it might even be better to use X-SAMPA phonemes, which PicoTTS supports. Try something like:

    "fr")
        lang="fr-FR"
        # "Windows" (the operating system)
        text=$(echo "$text" | sed 's|\bwindows\b|<phoneme ph=\"win.doz\"/>|gI')
        ;;

Sounds like: foliate-sox.wav.zip

Happy experimenting!

Moonbase59 commented 2 years ago

Updated version of the script: foliate-picotts.zip

Try:

FOLIATE_TTS_LANG_LOWER='fr'; echo "Je préfère Linux à Windows." | foliate-picotts

;-)

pabloab commented 2 years ago

Meanwhile, should be possible to use gTTS, but echo "$text" | gtts-cli - -l $FOLIATE_TTS_LANG_LOWER | play -q -t mp3 - -t alsa doesn't seems to work.

BTW, should be mention that this is a ISO 639-1 language code, not a three-letter 639-3 code (like used e.g. by Tesseract).

Lume6 commented 1 year ago

Hi ! A reinstall of Linux made me lose foliate-picotts. The new installation of the program from the foliate-picotts.zip archive fails with explanation I can't use. Here is the output of ./foliate-picotts ( I used the parameters -vx #!/bin/bash -vx) :

bernard@bernard:~/apps$ ./foliate-picotts  
 #!/bin/bash -vx 

# foliate-picotts -- Speak Foliate e-book using PicoTTS and PulseAudio
#
# Requirements:
# - pico2wave -- sudo apt install libttspico-utils
# - paplay -- Most modern systems now use PulseAudio for output
# - sox -- sudo apt install sox
# - sed -- POSIX standard command
#
# Use F5 within Foliate to start/stop speech.
#
# 2021-11-06 -- Matthias C. Hormann aka Moonbase59
#   - Code cleanup, bugfixing, added sox post-processing for better understanding.
#   - Added hypenation (en-dash, em-dash) support.
# 2021-11-09 -- Matthias C. Hormann aka Moonbase59

#   .......
# 2022-01-01 -- Matthias C. Hormann aka Moonbase59
#   - Final adaption for Foliate as TTS output script.
#   - Added F5 start/stop handling (SIGINT to script by Foliate)
# 2022-02-13 -- Matthias C. Hormann aka Moonbase59
#   - Add sox commands for FR, IT, ES, to prevent error (missing /tmp/foliate-sox.wav).
#   - Add French "Windows" pronunciation (thanks @Lume6!).

text=$(cat) # get text from stdin into text buffer
++ cat

As you notice, it stops line withe the ++ cat message: text=$(cat) # get text from stdin into text buffer Who would have an explanation? Thank you

Lume6 commented 1 year ago

Good evening, In fact, I had reinstalled the distribution and I had forgotten that I had to copy the script or create a symbolic line to /usr/local/bin, for example. It works very well. Thanks again to you!

johnfactotum commented 1 year ago

The GTK 4 version now uses speech-dispatcher exclusively.

Probably one can still add back the scripting ability. But it should have a better interface that works similarly to how it currently works with speech-dispatcher: