Open zdj97 opened 7 months ago
@ylacombe Why did you choose g2p specifically? I had to swap it with espeak-ng phonemizer for Spanish because g2p doesn't support Spanish. Happy to push my changes later in the week.
@ittailup, this work started as a reproduction of this research paper Natural language guidance of high-fidelity text-to-speech with synthetic annotations, which uses g2p
!
Also, we considered:
g2p
fulfill our requirements! @ittailup I am interested to fine-tuning the current model to other languages, i.e., Spanish, did you use the existing trained model and prompt tokenizer "parler-tts/parler_tts_mini_v0.1" or did you train from scratch with custom tokenizer for espeak-ng? Thank you for your insights.
@taalua I took the mini_v0.1 checkpoint and fine tuned it with my dataset. this was my "rate_apply" (written by Claude).
from phonemizer import phonemize
from phonemizer.backend import EspeakBackend
backend = EspeakBackend('es-es', with_stress=True)
def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
if isinstance(batch[audio_column_name], list):
speaking_rates = []
phonemes_list = []
for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
phonemes = phonemize(text, language='es-es', backend='espeak', with_stress=True)
sample_rate = audio["sampling_rate"]
audio_length = len(audio["array"].squeeze()) / sample_rate
speaking_rate = len(phonemes) / audio_length
speaking_rates.append(speaking_rate)
phonemes_list.append(phonemes)
batch["speaking_rate"] = speaking_rates
batch["phonemes"] = phonemes_list
else:
phonemes = phonemize(batch[text_column_name], language='es-es', backend='espeak', with_stress=True)
sample_rate = batch[audio_column_name]["sampling_rate"]
audio_length = len(batch[audio_column_name]["array"].squeeze()) / sample_rate
speaking_rate = len(phonemes) / audio_length
batch["speaking_rate"] = speaking_rate
batch["phonemes"] = phonemes
return batch
@taalua I did not have to change the prompt at https://github.com/huggingface/dataspeech/blob/8fd2dd4599d27ba775c6b5c4dff60cd70ee2eb3c/scripts/run_prompt_creation.py#L317
I did add a nationality to the "text description" so "A man" would become "A {country_name} man", but this was a text replace after building the the initial dataspeech dataset.
@ittailup Thank you. I appreciate your help. So the tokenizer remains the same, i.e., parler-tts/parler_tts_mini_v0.1. Does fine-tuning work well with Spanish using mini_v0.1?
How much data for fine-tuning you have, and also how many epochs do you need?
Parler gave me the best results of all the pipelines and models I had tested. Better than Piper, easier to train than pflow, vits2, styletts2. The voice quality with ~15h of speech and 39 epochs was very impressive. Even after 10k steps the quality was probably good enough to stop, we did 54k.
Hey @ittailup, this is great to hear! Would you mind sharing some samples out of curiosity? Also don't hesitate to share the model, if that's something you can do!
@taalua I did not have to change the prompt at
I did add a nationality to the "text description" so "A man" would become "A {country_name} man", but this was a text replace after building the the initial dataspeech dataset.
Hi thanks, I tried your rate
@taalua I took the mini_v0.1 checkpoint and fine tuned it with my dataset. this was my "rate_apply" (written by Claude).
from phonemizer import phonemize from phonemizer.backend import EspeakBackend backend = EspeakBackend('es-es', with_stress=True) def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"): if isinstance(batch[audio_column_name], list): speaking_rates = [] phonemes_list = [] for text, audio in zip(batch[text_column_name], batch[audio_column_name]): phonemes = phonemize(text, language='es-es', backend='espeak', with_stress=True) sample_rate = audio["sampling_rate"] audio_length = len(audio["array"].squeeze()) / sample_rate speaking_rate = len(phonemes) / audio_length speaking_rates.append(speaking_rate) phonemes_list.append(phonemes) batch["speaking_rate"] = speaking_rates batch["phonemes"] = phonemes_list else: phonemes = phonemize(batch[text_column_name], language='es-es', backend='espeak', with_stress=True) sample_rate = batch[audio_column_name]["sampling_rate"] audio_length = len(batch[audio_column_name]["array"].squeeze()) / sample_rate speaking_rate = len(phonemes) / audio_length batch["speaking_rate"] = speaking_rate batch["phonemes"] = phonemes return batch
Thanks I tried using your "rate_apply" (not using g2p) and finetuned using Indonesian speech dataset from Common Voice 13, and works and the result is also good although using only 1706 samples.
here is the result:
https://github.com/huggingface/dataspeech/assets/10645543/91449791-092e-43bc-ba17-b98512d74231
Hey @yoesak,, thanks for sharing the sample, it looks really great! Would you be potentially interested in sharing the model publicly? (also cc @ittailup in case you'd be interested as well!)
Hey @yoesak,, thanks for sharing the sample, it looks really great! Would you be potentially interested in sharing the model publicly? (also cc @ittailup in case you'd be interested as well!)
Yes, but the model is not stable yet, in my finding, if I use espeak backend for larger amount data, I got memory leak, so I decided to use custom phoneme module, since I only need for Indonesian language. soon after I finished the training, I will let you know.
if I use espeak backend for larger amount data, I got memory leak This is interesting, have you tried using a traditional LLM tokenizer?
How is the training going? Let me know if I can help!
et. this was my "rate_apply" (written by Claud
This method had some issues when I use it for Tamil language. Here is the updated version
from phonemizer.backend import EspeakBackend
backend = EspeakBackend("ta")
def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
if isinstance(batch[text_column_name], list):
speaking_rates = []
phonemes_list = []
if "speech_duration" in batch:
for text, audio_duration in zip(
batch[text_column_name], batch["speech_duration"]
):
phonemes = backend.phonemize(text, strip=True)[0]
audio_duration = audio_duration if audio_duration != 0 else 0.01
speaking_rate = len(phonemes) / audio_duration
speaking_rates.append(speaking_rate)
phonemes_list.append(phonemes)
else:
for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
phonemes = backend.phonemize(text, strip=True)[0]
sample_rate = audio["sampling_rate"]
audio_length = len(audio["array"].squeeze()) / sample_rate
speaking_rate = len(phonemes) / audio_length
speaking_rates.append(speaking_rate)
phonemes_list.append(phonemes)
batch["speaking_rate"] = speaking_rates
batch["phonemes"] = phonemes_list
else:
phonemes = backend.phonemize(list(batch[text_column_name]), strip=True)[0]
print(phonemes)
if "speech_duration" in batch:
audio_length = (
batch["speech_duration"] if batch["speech_duration"] != 0 else 0.01
)
else:
sample_rate = batch[audio_column_name]["sampling_rate"]
audio_length = (
len(batch[audio_column_name]["array"].squeeze()) / sample_rate
)
speaking_rate = len(phonemes) / audio_length
batch["speaking_rate"] = speaking_rate
batch["phonemes"] = phonemes
return batch```
I'm trying to use this code for japanese :
kakasi = pykakasi.kakasi()
backend = BACKENDS["espeak"]("ja", language_switch="remove-flags") # backend initiate just once, faster for multi-processing
separator = Separator('|', '', ' ')
I’m not sure if this works; just sharing.
Hey @zdj97, at the moment, we don't support other languages. However, must of the approaches here are language-agnostic, and I can only think of the speaking rate estimator which is English specific. The speaking rate is simply computed for now as the audio length divided by the number of phonemes. The latter is computed with g2p on English specifically.
What languages do you have in mind? Would you like to open a PR to add support for other languages ? Let me know !