Judith Manzoni recorded in 2014 at Saarland University a multilingual Luxembourgish/French/German speech database for the MaryTTS project. The audio data is provided in a single FLAC file, recorded at 48 kHz sampling frequency with 16 bit per sample. The transcriptions are provided in a single YAML file. The data is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The dataset includes the following transcribed audio clips :
I optimized this dataset to create a luxembourgish synthetic voice by training a deep machine learning system, based on neural networks. The following transformations have been done :
The samples with a standard deviation between the audio- and text-length higher than 0.8 have been removed after the final quality check
The result is a new database with 648 samples, called Marylux-648-TTS-Corpus.
The different transformation steps are described in detail in the next chapter.
There are numerous tools and libraries available to modify the properties of an audio-file which can be used in a bash- or python-script, for example ffmpeg, sox, librosa, ... I used the resample.py
script from Coqui-TTS, based on librosa, to process the Marylux dataset. Here is the related command for my environment :
python TTS/bin/resample.py --input_dir /workspace/myTTS-Project/datasets/marylux/wav48000/ --output_dir /workspace/myTTS-Project/datasets/marylux/wav22050/ --output_sr 22050
The next figure shows a screenshot from the free, open source, cross-platform audio software Audacity showing a typical audio-clip with long silence periods before and after the speech signal.
figure 1
The deep machine learning TTS training is disturbed by long silence periods. The tools and software introduced above can also been used to remove silence from audio clips. Here is a typical bash command using sox
to remove silence and to resample all audio clips in a folder in the same go :
for file in wavs/*.wav; do sox "$file" "output/$file" silence 1 0.01 1% reverse silence 1 0.01 1% reverse rate -h 22050 norm -0.1 pad 0.05 0.05; done
The following figure shows the trimmed and normalized audio-clip :
figure 2
The deep machine learning TTS training is sensitive to the level of the audio signal. To avoid differences in the volume of the clips of a TTS dataset the levels should be normalized. This can be done with the same tools and programs introduced before. We must distinguish between peak- and RMS-levels. The peak level is defined by the highest peaks within the signal independently of the amount of energy they are representing. The audio-signal shown in figure 2 has been normalized to a full-scale peak level. During TTS training this can lead to out-of-range amplitudes and auto-clipping.
A better reference for TTS training is RMS (root mean square), the average of the loudness in the waveform as a whole. Broadcasters and streaming providers like Youtube or Spotify measure and normalize the loudness in LUFS, which is similar to RMS. The EBU recommendation R128 (= ITU-R BS.1770) specifies the technical details for the loudness normalization. I used the Python script loudness.py to normalize the audio clips of the Marylux dataset with a reference level of -25 dB. The next figure shows the following normalized clips in the Audacity program :
figure 3
The audio splitting has been done manually in Audacity. To calculate the size of an uncompressed audio file, we have to multiply the bit rate (352,8 kbps) of the audio by its duration in seconds. An Marylux audio file of 10 seconds has a size of 441 KB. If we order the audio files in a folder by size, it's easy to select all files exceeding a size of 440 KB and to import them into Audacity. I repeated the following process for all samples :
export the two file parts with the old and new filename
The next figure shows the process in the Audacity window for sample lb-wiki-0543.wav.
figure 4
Some TTS models fail while training single words or they ignore them. To avoid these problems I assembled the related audio clips and csv rows manually with Audacity and with the text editor. I named the new 12 clips as lb-words-a.wav, lb-words-b.wav, up to lb-words-l.wav.
Bad audio quality with much noise is a no-go for deep machine learning TTS training. Breath, cough, stutter, background noise, echos and other disturbing sounds presents great challenges for TTS model training and must be discarded. There are several tools and python libraries available to denoise the audio clips, but in my trials none of them provided good results without manual supervision. My favorite tool is the Audacity noise reduction plugin. By selecting a noisy region in the audio track you can define and save a noise profile. The effect of reducing noise based on this profile can be tested in a preview and applied if the result was satisfactory.
figure 5
Fortunately the original Marylux audio files are of high quality and I was able to discard a few disturbing sounds manually in Audacity during the sound check done for the text correction.
To check if the text and audio of the resulting 660 samples are congruent, I used the following tools arrangement on my desktop-PC :
figure 6
I imported the audio clips into Audacity and looped through the different tracks to listen the speech and to compare it with the text in the metadata.csv
file, displayed in a text-editor. Some remaining errors have been redressed. At the end the database was ready for a final automatic quality check.
The final quality check was done with the notebook TTS/notebooks/dataset_analysis/AnalyzeDataset.ipynb provided by Coqui-ai. This program checks if all wav files listed in the metadata.csv
file are available and unique (no duplicates), calculates mean- and median-values for audio- and text-lengths, counts the number of different words (3.668) in the dataset and plots the results. The next figure shows the plotted graph of the standard deviation between audio-lengths and character-counts.
figure 7
For best results with the deep machine learning TTS training
a standard-deviation less than 0.8 is recommended. I identified the samples out of scope and analysed the related audio-clips and transcriptions. In most cases the reason for the deviation was obvious. An example is shown below :
figure 8
Due to the silence between the single words, separated by commas, the audio-length is very high in comparison to the character-count. Spectrograms can be a great help to check the audio quality of samples where the reason of the deviation is not evident. A great tool is Sonogram Visible Speech, version 5. The following figure gives an overview about the features of this software.
figure 9
To assure a high quality, I removed the following 12 samples of the intermediate MARYLUX-660 corpus, based on the measurement results :
The following figures shows the plotted results for the validated Marylux-database with 648 samples.
figure 10
figure 11
A deep machine learning TTS model
is trained with tensors, a sequence of integers created by converting the symbols from the samples to indices. The symbols can be latin characters, arabic, greek or russian letters, japanese or chinese idiograms and logograms, phonemes, or even emoji's, and much more. The conversion is commonly done by calculating the position (index) of a symbol, extracted from the input-sample, in a predefined symbol-list. Some examples are shown below :
93_symbols = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!'(),-.:_;? #äöüïëéèêäâç@ɑɛæəɜɪʊœɛ̃ŋʀʦʒɕʑʁ"
input_letters = "De Nordwand an d'Sonn."
tensor = [3 30 64 13 40 43 29 48 26 39 29 64 26 39 64 29 53 18 40 39 39 58]
input_phonemes = "də noʀtvɑnt ɑn dzon."
tensor = [29 80 64 39 40 87 45 47 77 39 29 64 77 39 29 51 40 39 58]
input_phonemes_with_blanks = "_d_ə_ _n_o_ʀ_t_v_ɑ_n_t_ _ɑ_n_ _d_z_o_n_._"
tensor = tensor = [29 60 80 60 64 60 39 60 40 60 87 60 45 60 47 60 77 60 39 60 29 60 64 60 77 60 39 60 29 60 51 60 40 60 39 60 58]
In the past an alphabetic system of phonetic notation has been used for TTS voice synthesis. The first pseudo-standards for the phonetic notation, for example Kirshenbaum and SAMPA, have been progressively replaced by the International Phonetic Alphabet (IPA), based primarily on the Latin script. To generate the phonemes from letters, a conversion program is required. Initially these programs have been rule based. Currently these converters, called g2p (grapheme to phoneme) models, are also trained by deep machine learning. An automatic phonetic transcription tool for Luxembourgish, created by Peter Gilles, is available at the luxembourgish web portal of the University of Luxembourg.
figure 12
The Luxembourgish Online Dictionary (LOD), maintained by the Zenter fir d'Lëtzebuerger Sprooch (ZLS), provides phonetic transcriptions for most luxembourgish words.
figure 13
As both the phonemizer- and the voice-models are based on deep machine learning with neural networks and tensors, a legitimate question is why doing two sequential trainings to convert letters into phonemes and afterwards convert phonemes via indices (integers) into audio signals. Why not transforming in one training process graphemes into audio signals ? Most recent TTS models are adopting this option and the resulting speech quality is even better then by using the classic procedure, but more computer performance and more training time is required to get valid results.
The Marylux-648 dataset can be used for both learning options.
eSpeak-NG and Rhasspy-Gruut are two famous open-source phonemizers which are used by numerous TTS projects. A few months ago I developped the code to integrate the luxembourgish language into eSpeak-NG. The code was merged into the main eSpeak-NG project with my Github pull request #1038 on November 11, 2021. Now Luxembourgish is the 127th language supported by eSpeak-NG. A luxembourgish voice, based on formant synthesis techniques, is part of my package. The voice is intelligible, but of low quality. I did no sound optimization because my focus was put on the rule-based phonemization front-end process. The eSpeak-NG lb-phonemizer includes a luxembourgish emoji-dictionary which translates some children-emojis into the names of my grand-children. Some animal-graphics and other emojis are also converted to the related luxembourgish phonetic transcriptions. Two examples of sentences which can be handled by eSpeak-NG-lb are shown below:
Haut sinn ☝ mat mengen Enkelkanner 🧑🤝🧑 , 👦 , 👧 , an 👩 an den 🎪 gaangen. Do hunn mer e 🦍, eng 🦒, en 🐘 an en 🦏 gesinn.
An der 🕰 hunn sech den 🧭💨 an d’🌞 gestridden, wie vun hinnen zwee wuel méi 💪 wier, wéi e 🚶, deen an ee waarme 🧥 agepak war, iwwert de 🛤 koum.
The integration of the luxembourgish language into the gruut-phonemizer is more recent. My code to support Luxembourgish was merged into the gruut-ipa repository with my Github pull request #7 on November 10, 2021. My main code was merged into the gruut project with my Github pull request #18 on December 6, 2021.
The luxembourgish phonemes list used in both phonemizers is the following :
vowels | words | diphtongs | words | monophtongs | from loanwords |
---|---|---|---|---|---|
ɑ | k[a]pp | æːɪ | z[äi]t | y | conj[u]gaisoun |
aː | k[a]p | ɑʊ | [au]to | y: | s[ü]den |
ɛː | st[ä]ren | æːʊ | r[au]m | ãː | restaur[ant] |
e | m[é]ck | ɑɪ | l[ei]t | õː | sais[on] |
æ | h[e]ll | ɜɪ | fr[éi] | ɛ̃ː | cous[in] |
eː | k[ee]ss | oɪ | [eu]ro | œː | interi[eu]r |
ə | n[e]t | iə | h[ie]n | ||
ɐ | kann[er] | əʊ | sch[ou]l | ||
i | m[i]dd | uə | b[ue]dem | ||
iː | l[ii]cht | ||||
o | spr[o]ch | ||||
oː | spr[oo]ch | ||||
u | g[u]tt | ||||
uː | d[uu]scht |
consonants | words | consonants | words |
---|---|---|---|
Nasals | Affricates | ||
m | [m]a[mm] | ʦ | schwä[tz]en |
n | ma[nn] | dʒ | bu[dg]et |
ŋ | ke[ng] | Fricatives | |
Plosives | f | [f]ësch | |
p | [p]aken | v | [v]akanz |
b | [b]aken | w | sch[w]aarz |
t | blu[tt] | s | taa[ss] |
d | [d]äiwel | z | [s]ummer |
k | [k]eess | ʃ | bii[sch]t |
g | [g]eess | ʒ | pro[j]et |
Approximants | ɕ | lii[ch]t | |
l | [l]oft | ʁ | ku[g]el |
j | [j]o | ʑ | spi[g]el |
Trills | h | [h]ei | |
ʀ | [r]ou |
Here is the associated phonetic luxembourgish dictionary, based on the luxembourgish-language ressources, provided by Peter Gilles on Github. I did some corrections, modifications and additions.
The fully support of the luxembourgish language by the big TTS-projects with embedded eSpeak-NG- or Gruut-Phonemizer will only be assured when these projects update their code-base to the latest versions of the concerned dependencies. In the mean-time the luxembourgish phonemes must be provided in the external training- and validation files and some hacking is required to feed these files as input to the TTS-models for training.
For this purpose I prepared different Marylux-648 dataset versions which are described in the next chapters.
The reference for the text format of the Marylux transcription file is the public domain dataset LJSpeech. All text samples are assembled in one file called metadata.csv
. Each row contains three columns, separated by the pipe |
symbol :
Here are some simple examples :
marylux_lb-wiki-0473|Dës éischt Versioun hat nëmmen 3 Strofen.|dës éischt versioun hat nëmmen dräi strofen.
marylux_lb-wiki-0007|D'Bréck hat 4 Béi mat Feldwäite vun 18,33 m.|dbʀæk haːt fɜɪʁ bɜɪ mɑ feːltvæːɪtə fun uəɕtʦeɳ komaː dʀæːɪɑndʀəsəɕ meːtʁ.
marylux_lb-wiki-0140|De Rouscht ass eng Uertschaft an der Gemeng Biissen.|d_ə_ _ʀ_əʊ_ʃ_t_ _ɑ_s_ _æ_ɳ_ _u_ʁ_tʃ_ɑ_f_t_ _ɑ_n_ _d_ʁ_ _g_ə_m_æ_ɳ_ _b_iː_s_ə_n_.
marylux_lb-wiki-0171|Der Kore hire Papp Zeus hat sech a si verléift.|14 15 28 21 25 28 15 18 19 28 15 26 11 26 26 36 15 31 29 18 11 30 29 15 13 18 11 29 19 32 15 28 22 44 19 16 30 8
The checked and validated Marylux TTS database contains 648 luxembourgish samples. Additionally a list of 6 luxembourgish sentences, based on the Aesop's fables, is provided for synthesizing tests during the training :
1. An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum.
2. Si goufen sech eens, datt deejéinege fir dee Stäerkste gëlle sollt, deen de Wanderer forcéiere géif, säi Mantel auszedoen.
3. Den Nordwand huet mat aller Force geblosen, awer wat e méi geblosen huet, wat de Wanderer sech méi a säi Mantel agewéckelt huet.
4. Um Enn huet den Nordwand säi Kampf opginn.
5. Dunn huet d’Sonn d’Loft mat hire frëndleche Strale gewiermt, a schonn no kuerzer Zäit huet de Wanderer säi Mantel ausgedoen.
6. Do huet den Nordwand missen zouginn, datt d’Sonn vun hinnen zwee dee Stäerkste wier.
The total duration of the audio clips is 57 minutes, 31 seconds.
Two archives of the Marylux-648 database are available for download in the release section of the present Github repository.
An archive includes the following content:
The audio files sampled with 22050Hz are best suited to train mono-speaker TTS models, those sampled with 16000Hz are suited for multi-speaker models, together with other luxembourgish speech datasets.
A batch script to download, decompress, shuffle, split and install these archives is stored in the scripts folder. You must set several parameters in the script to install the files with the required features letters, phonemes, phonemes-ids, .. and formats.
An good splitting of this database for machine learning is the following :