Closed mjonsson1986 closed 8 months ago
This is an example of the only Swedish dataset I have available: https://rhasspy.github.io/larynx/#sv-se_talesyntese-glow_tts
I do plan to train a Piper voice for it, but if anyone is interested in volunteering to donate their voice please e-mail me at voice@nabucasa.com :slightly_smiling_face:
There exists a Swedish speech synthesis dataset from the same source as the ones you have already used to train Norwegian and Danish @synesthesiam . It consists of about 5300 recordings of the same male voice.
The documentation for it (in Norwegian) can be found in the following pdf in section 8. The direct download link for the dataset: http://www.nb.no/sbfil/talesyntese/sve.ibm.talesyntese.tar.gz
It can also be accessed via clarin.eu.
And there is a Huggingface dataset that can serve as an example how to read the audio files. I have used this Huggingface dataset myself to create a LJSpeech format dataset. Here's my code:
import os
import librosa
import soundfile as sf
import pandas as pd
from datasets import load_dataset
from tqdm import tqdm
dataset = load_dataset("jimregan/nst_swedish_tts", cache_dir="data")
os.makedirs("data/nst_tts/wavs", exist_ok=True)
def write_wav(example):
audio = librosa.resample(example["audio"]["array"], orig_sr=44100, target_sr=22050)
sf.write(f"data/nst_tts/wavs/{example['file_stem']}.wav", audio, 22050, format="WAV")
dataset.map(write_wav, num_proc=16)
filestems = dataset["train"]["file_stem"]
text = dataset["train"]["text"]
normalized_text = dataset["train"]["text"]
# Export as LJSpeech format
df = pd.DataFrame({"id": filestems, "transcription": text, "normalized_transcription": normalized_text})
# Pipe separated values without header
df.to_csv("data/nst_tts/metadata.csv", sep="|", index=False, header=False)
I have been pretraining a Swedish TTS voice using this dataset and the piper library for the past week. It's training at medium quality, and it is currently at epoch 3527. I think it sounds pretty decent already.
Whats the best way for me to contribute the weights to piper @synesthesiam ?
@Lauler If you can share the config.json
and the latest checkpoint, that would be great :slightly_smiling_face:
I have uploaded model checkpoint and config file here:
https://huggingface.co/KBLab/piper-tts-nst-swedish/tree/main
Both weights and data under CC0. But very much appreciate if you add a line in the model card that model was contributed by KBLab at The National Library of Sweden!
Thank you @Lauler! It's available now here: https://rhasspy.github.io/piper-samples/#sv-se
It seems to have trouble with "fenomen", much like the German model. Should be /fɛnʊmˈeːn/ ( https://sv.wiktionary.org/wiki/fenomen ). Is it an espeak-ng issue? Not perfect on "meteorologiskt" and "nyanser" neither.
The rest sounds good.
@mjonsson1986 The model was trained on about 5300 audio clip recordings from a single male Swedish speaker. They were recorded by a Norwegian company called Nordisk Språkteknologi (NST) for the specific purpose of training speech synthesis systems. This company went in to bankruptcy 2003. After their bankruptcy, their data and technology was acquired by a couple Norwegian universities and IBM. Later on the data was made freely available.
A component of this system is espeak-ng, which I think guides the model phonetically, but I'm not super familiar to what extent piper is reliant on espeak-ng.
Here's what espeak-ng gives for pronunciations:
echo 'fenomen meteorologiskt nyanser' | xargs -n1 espeak-ng -v sv --ipa=3
fˈeːnuːmən
mˌeːtəˌuːruːlˈoːɡɪskt
nˈyːansər
A Swedish voice is now available here: https://huggingface.co/rhasspy/piper-voices/tree/main/sv/sv_SE