rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
4.57k stars 315 forks source link

will there be an swedish voice soon #72

Closed mjonsson1986 closed 8 months ago

mjonsson1986 commented 1 year ago
  we allready have danish and norwegian. will there soon be an swedish voice too
synesthesiam commented 1 year ago

This is an example of the only Swedish dataset I have available: https://rhasspy.github.io/larynx/#sv-se_talesyntese-glow_tts

I do plan to train a Piper voice for it, but if anyone is interested in volunteering to donate their voice please e-mail me at voice@nabucasa.com :slightly_smiling_face:

Lauler commented 1 year ago

There exists a Swedish speech synthesis dataset from the same source as the ones you have already used to train Norwegian and Danish @synesthesiam . It consists of about 5300 recordings of the same male voice.

The documentation for it (in Norwegian) can be found in the following pdf in section 8. The direct download link for the dataset: http://www.nb.no/sbfil/talesyntese/sve.ibm.talesyntese.tar.gz

It can also be accessed via clarin.eu.

And there is a Huggingface dataset that can serve as an example how to read the audio files. I have used this Huggingface dataset myself to create a LJSpeech format dataset. Here's my code:

import os
import librosa
import soundfile as sf
import pandas as pd
from datasets import load_dataset
from tqdm import tqdm

dataset = load_dataset("jimregan/nst_swedish_tts", cache_dir="data")

os.makedirs("data/nst_tts/wavs", exist_ok=True)

def write_wav(example):
    audio = librosa.resample(example["audio"]["array"], orig_sr=44100, target_sr=22050)
    sf.write(f"data/nst_tts/wavs/{example['file_stem']}.wav", audio, 22050, format="WAV")

dataset.map(write_wav, num_proc=16)

filestems = dataset["train"]["file_stem"]
text = dataset["train"]["text"]
normalized_text = dataset["train"]["text"]

# Export as LJSpeech format
df = pd.DataFrame({"id": filestems, "transcription": text, "normalized_transcription": normalized_text})

# Pipe separated values without header
df.to_csv("data/nst_tts/metadata.csv", sep="|", index=False, header=False)

I have been pretraining a Swedish TTS voice using this dataset and the piper library for the past week. It's training at medium quality, and it is currently at epoch 3527. I think it sounds pretty decent already.

Whats the best way for me to contribute the weights to piper @synesthesiam ?

synesthesiam commented 1 year ago

@Lauler If you can share the config.json and the latest checkpoint, that would be great :slightly_smiling_face:

Lauler commented 1 year ago

I have uploaded model checkpoint and config file here:

https://huggingface.co/KBLab/piper-tts-nst-swedish/tree/main

Both weights and data under CC0. But very much appreciate if you add a line in the model card that model was contributed by KBLab at The National Library of Sweden!

synesthesiam commented 1 year ago

Thank you @Lauler! It's available now here: https://rhasspy.github.io/piper-samples/#sv-se

mjonsson1986 commented 1 year ago
Are those voices based on human speech?Sounds soSounds almost like a real personI’m totaly blind so for me it sounds like a real personSkickades från E-post för Windows Från: Michael HansenSkickat: den 16 maj 2023 19:03Till: rhasspy/piperKopia: mjonsson1986; AuthorÄmne: Re: [rhasspy/piper] will there be an swedish voice soon (Issue #72) Thank you @Lauler! It's available now here: https://rhasspy.github.io/piper-samples/#sv-se—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***> 
Lauler commented 1 year ago

It seems to have trouble with "fenomen", much like the German model. Should be /fɛnʊmˈeːn/ ( https://sv.wiktionary.org/wiki/fenomen ). Is it an espeak-ng issue? Not perfect on "meteorologiskt" and "nyanser" neither.

The rest sounds good.

@mjonsson1986 The model was trained on about 5300 audio clip recordings from a single male Swedish speaker. They were recorded by a Norwegian company called Nordisk Språkteknologi (NST) for the specific purpose of training speech synthesis systems. This company went in to bankruptcy 2003. After their bankruptcy, their data and technology was acquired by a couple Norwegian universities and IBM. Later on the data was made freely available.

A component of this system is espeak-ng, which I think guides the model phonetically, but I'm not super familiar to what extent piper is reliant on espeak-ng.

synesthesiam commented 1 year ago

Here's what espeak-ng gives for pronunciations:

echo 'fenomen meteorologiskt nyanser' | xargs -n1 espeak-ng -v sv --ipa=3
fˈeːnuːmən
mˌeːtəˌuːruːlˈoːɡɪskt
nˈyːansər
synesthesiam commented 8 months ago

A Swedish voice is now available here: https://huggingface.co/rhasspy/piper-voices/tree/main/sv/sv_SE