kxxt / aspeak

A simple text-to-speech client for Azure TTS API.
MIT License
494 stars 57 forks source link
aspeak azure-cognitive-services cli python speech-synthesis text-to-speech tts tts-engine

:speaking_head: aspeak

GitHub stars GitHub issues GitHub forks GitHub license

A simple text-to-speech client for Azure TTS API. :laughing:

Note

Starting from version 6.0.0, aspeak by default uses the RESTful API of Azure TTS. If you want to use the WebSocket API, you can specify --mode websocket when invoking aspeak or set mode = "websocket" in the auth section of your profile.

Starting from version 4.0.0, aspeak is rewritten in rust. The old python version is available at the python branch.

You can sign up for an Azure account and then choose a payment plan as needed (or stick to free tier). The free tier includes a quota of 0.5 million characters per month, free of charge.

Please refer to the Authentication section to learn how to set up authentication for aspeak.

Installation

Download from GitHub Releases (Recommended for most users)

Download the latest release from here.

After downloading, extract the archive and you will get a binary executable file.

You can put it in a directory that is in your PATH environment variable so that you can run it from anywhere.

Install from AUR (Recommended for Arch Linux users)

From v4.1.0, You can install aspeak-bin from AUR.

Install from PyPI

Installing from PyPI will also install the python binding of aspeak for you. Check Library Usage#Python for more information on using the python binding.

pip install -U aspeak==6.0.0

Now the prebuilt wheels are only available for x86_64 architecture. Due to some technical issues, I haven't uploaded the source distribution to PyPI yet. So to build wheel from source, you need to follow the instructions in Install from Source.

Because of manylinux compatibility issues, the wheels for linux are not available on PyPI. (But you can still build them from source.)

Install from Source

CLI Only

The easiest way to install aspeak from source is to use cargo:

cargo install aspeak -F binary

Alternatively, you can also install aspeak from AUR.

Python Wheel

To build the python wheel, you need to install maturin first:

pip install maturin

After cloning the repository and cd into the directory , you can build the wheel by running:

maturin build --release --strip -F python --bindings pyo3 --interpreter python --manifest-path Cargo.toml --out dist-pyo3
maturin build --release --strip --bindings bin -F binary --interpreter python --manifest-path Cargo.toml --out dist-bin
bash merge-wheel.bash

If everything goes well, you will get a wheel file in the dist directory.

Usage

Run aspeak help to see the help message.

Run aspeak help <subcommand> to see the help message of a subcommand.

Authentication

The authentication options should be placed before any subcommand.

For example, to utilize your subscription key and an official endpoint designated by a region, run the following command:

$ aspeak --region <YOUR_REGION> --key <YOUR_SUBSCRIPTION_KEY> text "Hello World"

If you are using a custom endpoint, you can use the --endpoint option instead of --region.

To avoid repetition, you can store your authentication details in your aspeak profile. Read the following section for more details.

From v5.2.0, you can also set the authentication secrets via the following environment variables:

From v4.3.0, you can let aspeak use a proxy server to connect to the endpoint. For now, only http and socks5 proxies are supported (no https support yet). For example:

$ aspeak --proxy http://your_proxy_server:port text "Hello World"
$ aspeak --proxy socks5://your_proxy_server:port text "Hello World"

aspeak also respects the HTTP_PROXY(or http_proxy) environment variable.

Configuration

aspeak v4 introduces the concept of profiles. A profile is a configuration file where you can specify default values for the command line options.

Run the following command to create your default profile:

$ aspeak config init

To edit the profile, run:

$ aspeak config edit

If you have trouble running the above command, you can edit the profile manually:

Fist get the path of the profile by running:

$ aspeak config where

Then edit the file with your favorite text editor.

The profile is a TOML file. The default profile looks like this:

Check the comments in the config file for more information about available options.

# Profile for aspeak
# GitHub: https://github.com/kxxt/aspeak

# Output verbosity
# 0   - Default
# 1   - Verbose
# The following output verbosity levels are only supported on debug build
# 2   - Debug
# >=3 - Trace
verbosity = 0

#
# Authentication configuration
#

[auth]
# Endpoint for TTS
# endpoint = "wss://eastus.tts.speech.microsoft.com/cognitiveservices/websocket/v1"

# Alternatively, you can specify the region if you are using official endpoints
# region = "eastus"

# Synthesizer Mode, "rest" or "websocket"
# mode = "rest"

# Azure Subscription Key
# key = "YOUR_KEY"

# Authentication Token
# token = "Your Authentication Token"

# Extra http headers (for experts)
# headers = [["X-My-Header", "My-Value"], ["X-My-Header2", "My-Value2"]]

# Proxy
# proxy = "socks5://127.0.0.1:7890"

# Voice list API url
# voice_list_api = "Custom voice list API url"

#
# Configuration for text subcommand
#

[text]
# Voice to use. Note that it takes precedence over the locale
# voice = "en-US-JennyNeural"
# Locale to use
locale = "en-US"
# Rate
# rate = 0
# Pitch
# pitch = 0
# Role
# role = "Boy"
# Style, "general" by default
# style = "general"
# Style degree, a floating-point number between 0.1 and 2.0
# style_degree = 1.0

#
# Output Configuration
#

[output]
# Container Format, Only wav/mp3/ogg/webm is supported.
container = "wav"
# Audio Quality. Run `aspeak list-qualities` to see available qualities.
#
# If you choose a container format that does not support the quality level you specified here, 
# we will automatically select the closest level for you.
quality = 0
# Audio Format(for experts). Run `aspeak list-formats` to see available formats.
# Note that it takes precedence over container and quality!
# format = "audio-16khz-128kbitrate-mono-mp3"

If you want to use a profile other than your default profile, you can use the --profile argument:

aspeak --profile <PATH_TO_A_PROFILE> text "Hello"

If you want to temporarily disable the profile, you can use the --no-profile argument:

aspeak --no-profile --region eastus --key <YOUR_KEY> text "Hello"

Pitch and Rate

Note: Unreasonable high/low values will be clipped to reasonable values by Azure Cognitive Services.

Examples

The following examples assume that you have already set up authentication in your profile.

Speak "Hello, world!" to default speaker.

$ aspeak text "Hello, world"

SSML to Speech

$ aspeak ssml << EOF
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'><voice name='en-US-JennyNeural'>Hello, world!</voice></speak>
EOF

List all available voices.

$ aspeak list-voices

List all available voices for Chinese.

$ aspeak list-voices -l zh-CN

Get information about a voice.

$ aspeak list-voices -v en-US-SaraNeural
Output ``` Microsoft Server Speech Text to Speech Voice (en-US, SaraNeural) Display name: Sara Local name: Sara @ en-US Locale: English (United States) Gender: Female ID: en-US-SaraNeural Voice type: Neural Status: GA Sample rate: 48000Hz Words per minute: 157 Styles: ["angry", "cheerful", "excited", "friendly", "hopeful", "sad", "shouting", "terrified", "unfriendly", "whispering"] ```

Save synthesized speech to a file.

$ aspeak text "Hello, world" -o output.wav

If you prefer mp3/ogg/webm, you can use -c mp3/-c ogg/-c webm option.

$ aspeak text "Hello, world" -o output.mp3 -c mp3
$ aspeak text "Hello, world" -o output.ogg -c ogg
$ aspeak text "Hello, world" -o output.webm -c webm

List available quality levels

$ aspeak list-qualities
Output ``` Qualities for MP3: 3: audio-48khz-192kbitrate-mono-mp3 2: audio-48khz-96kbitrate-mono-mp3 -3: audio-16khz-64kbitrate-mono-mp3 1: audio-24khz-160kbitrate-mono-mp3 -2: audio-16khz-128kbitrate-mono-mp3 -4: audio-16khz-32kbitrate-mono-mp3 -1: audio-24khz-48kbitrate-mono-mp3 0: audio-24khz-96kbitrate-mono-mp3 Qualities for WAV: -2: riff-8khz-16bit-mono-pcm 1: riff-24khz-16bit-mono-pcm 0: riff-24khz-16bit-mono-pcm -1: riff-16khz-16bit-mono-pcm Qualities for OGG: 0: ogg-24khz-16bit-mono-opus -1: ogg-16khz-16bit-mono-opus 1: ogg-48khz-16bit-mono-opus Qualities for WEBM: 0: webm-24khz-16bit-mono-opus -1: webm-16khz-16bit-mono-opus 1: webm-24khz-16bit-24kbps-mono-opus ```

List available audio formats (For expert users)

$ aspeak list-formats
Output ``` amr-wb-16000hz audio-16khz-128kbitrate-mono-mp3 audio-16khz-16bit-32kbps-mono-opus audio-16khz-32kbitrate-mono-mp3 audio-16khz-64kbitrate-mono-mp3 audio-24khz-160kbitrate-mono-mp3 audio-24khz-16bit-24kbps-mono-opus audio-24khz-16bit-48kbps-mono-opus audio-24khz-48kbitrate-mono-mp3 audio-24khz-96kbitrate-mono-mp3 audio-48khz-192kbitrate-mono-mp3 audio-48khz-96kbitrate-mono-mp3 ogg-16khz-16bit-mono-opus ogg-24khz-16bit-mono-opus ogg-48khz-16bit-mono-opus raw-16khz-16bit-mono-pcm raw-16khz-16bit-mono-truesilk raw-22050hz-16bit-mono-pcm raw-24khz-16bit-mono-pcm raw-24khz-16bit-mono-truesilk raw-44100hz-16bit-mono-pcm raw-48khz-16bit-mono-pcm raw-8khz-16bit-mono-pcm raw-8khz-8bit-mono-alaw raw-8khz-8bit-mono-mulaw riff-16khz-16bit-mono-pcm riff-22050hz-16bit-mono-pcm riff-24khz-16bit-mono-pcm riff-44100hz-16bit-mono-pcm riff-48khz-16bit-mono-pcm riff-8khz-16bit-mono-pcm riff-8khz-8bit-mono-alaw riff-8khz-8bit-mono-mulaw webm-16khz-16bit-mono-opus webm-24khz-16bit-24kbps-mono-opus webm-24khz-16bit-mono-opus ```

Increase/Decrease audio qualities

# Less than default quality.
$ aspeak text "Hello, world" -o output.mp3 -c mp3 -q=-1
# Best quality for mp3
$ aspeak text "Hello, world" -o output.mp3 -c mp3 -q=3

Read text from file and speak it.

$ cat input.txt | aspeak text

or

$ aspeak text -f input.txt

with custom encoding:

$ aspeak text -f input.txt -e gbk

Read from stdin and speak it.

$ aspeak text

maybe you prefer:

$ aspeak text -l zh-CN << EOF
我能吞下玻璃而不伤身体。
EOF

Speak Chinese.

$ aspeak text "你好,世界!" -l zh-CN

Use a custom voice.

$ aspeak text "你好,世界!" -v zh-CN-YunjianNeural

Custom pitch, rate and style

$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p 1.5 -r 0.5 -S sad
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=-10% -r=+5% -S cheerful
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=+40Hz -r=1.2f -S fearful
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=high -r=x-slow -S calm
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=+1st -r=-7% -S lyrical

Advanced Usage

Use a custom audio format for output

Note: Some audio formats are not supported when outputting to speaker.

$ aspeak text "Hello World" -F riff-48khz-16bit-mono-pcm -o high-quality.wav

Library Usage

Python

The new version of aspeak is written in Rust, and the Python binding is provided by PyO3.

Here is a simple example:

from aspeak import SpeechService

service =  SpeechService(region="eastus", key="YOUR_AZURE_SUBSCRIPTION_KEY")
service.speak_text("Hello, world")

First you need to create a SpeechService instance.

When creating a SpeechService instance, you can specify the following parameters:

After that, you can call speak_text() to speak the text or speak_ssml() to speak the SSML. Or you can call synthesize_text() or synthesize_ssml() to get the audio data.

For synthesize_text() and synthesize_ssml(), if you provide an output, the audio data will be written to that file and the function will return None. Otherwise, the function will return the audio data.

Here are the common options for speak_text() and synthesize_text():

Rust

Add aspeak to your Cargo.toml:

$ cargo add aspeak

Then follow the documentation of aspeak crate.

There are 4 examples for quick reference: