other language support?

ylacombe commented 7 months ago

Hey @zdj97, at the moment, we don't support other languages. However, must of the approaches here are language-agnostic, and I can only think of the speaking rate estimator which is English specific. The speaking rate is simply computed for now as the audio length divided by the number of phonemes. The latter is computed with g2p on English specifically.

What languages do you have in mind? Would you like to open a PR to add support for other languages ? Let me know !

ittailup commented 7 months ago

@ylacombe Why did you choose g2p specifically? I had to swap it with espeak-ng phonemizer for Spanish because g2p doesn't support Spanish. Happy to push my changes later in the week.

ylacombe commented 7 months ago

@ittailup, this work started as a reproduction of this research paper Natural language guidance of high-fidelity text-to-speech with synthetic annotations, which uses g2p! Also, we considered:

License
Dependencies and easiness of installation for which g2p fulfill our requirements!

taalua commented 7 months ago

@ittailup I am interested to fine-tuning the current model to other languages, i.e., Spanish, did you use the existing trained model and prompt tokenizer "parler-tts/parler_tts_mini_v0.1" or did you train from scratch with custom tokenizer for espeak-ng? Thank you for your insights.

ittailup commented 6 months ago

@taalua I took the mini_v0.1 checkpoint and fine tuned it with my dataset. this was my "rate_apply" (written by Claude).

from phonemizer import phonemize
from phonemizer.backend import EspeakBackend

backend = EspeakBackend('es-es', with_stress=True)

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[audio_column_name], list):  
        speaking_rates = []
        phonemes_list = []
        for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
            phonemes = phonemize(text, language='es-es', backend='espeak', with_stress=True)

            sample_rate = audio["sampling_rate"]
            audio_length = len(audio["array"].squeeze()) / sample_rate

            speaking_rate = len(phonemes) / audio_length

            speaking_rates.append(speaking_rate)
            phonemes_list.append(phonemes)

        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = phonemize(batch[text_column_name], language='es-es', backend='espeak', with_stress=True)

        sample_rate = batch[audio_column_name]["sampling_rate"]
        audio_length = len(batch[audio_column_name]["array"].squeeze()) / sample_rate

        speaking_rate = len(phonemes) / audio_length

        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch

ittailup commented 6 months ago

@taalua I did not have to change the prompt at https://github.com/huggingface/dataspeech/blob/8fd2dd4599d27ba775c6b5c4dff60cd70ee2eb3c/scripts/run_prompt_creation.py#L317

I did add a nationality to the "text description" so "A man" would become "A {country_name} man", but this was a text replace after building the the initial dataspeech dataset.

taalua commented 6 months ago

@ittailup Thank you. I appreciate your help. So the tokenizer remains the same, i.e., parler-tts/parler_tts_mini_v0.1. Does fine-tuning work well with Spanish using mini_v0.1?

How much data for fine-tuning you have, and also how many epochs do you need?

ittailup commented 6 months ago

Parler gave me the best results of all the pipelines and models I had tested. Better than Piper, easier to train than pflow, vits2, styletts2. The voice quality with ~15h of speech and 39 epochs was very impressive. Even after 10k steps the quality was probably good enough to stop, we did 54k.

ylacombe commented 6 months ago

Hey @ittailup, this is great to hear! Would you mind sharing some samples out of curiosity? Also don't hesitate to share the model, if that's something you can do!

yoesak commented 6 months ago

@taalua I did not have to change the prompt at

https://github.com/huggingface/dataspeech/blob/8fd2dd4599d27ba775c6b5c4dff60cd70ee2eb3c/scripts/run_prompt_creation.py#L317

I did add a nationality to the "text description" so "A man" would become "A {country_name} man", but this was a text replace after building the the initial dataspeech dataset.

Hi thanks, I tried your rate

@taalua I took the mini_v0.1 checkpoint and fine tuned it with my dataset. this was my "rate_apply" (written by Claude).

from phonemizer import phonemize
from phonemizer.backend import EspeakBackend

backend = EspeakBackend('es-es', with_stress=True)

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[audio_column_name], list):  
        speaking_rates = []
        phonemes_list = []
        for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
            phonemes = phonemize(text, language='es-es', backend='espeak', with_stress=True)

            sample_rate = audio["sampling_rate"]
            audio_length = len(audio["array"].squeeze()) / sample_rate

            speaking_rate = len(phonemes) / audio_length

            speaking_rates.append(speaking_rate)
            phonemes_list.append(phonemes)

        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = phonemize(batch[text_column_name], language='es-es', backend='espeak', with_stress=True)

        sample_rate = batch[audio_column_name]["sampling_rate"]
        audio_length = len(batch[audio_column_name]["array"].squeeze()) / sample_rate

        speaking_rate = len(phonemes) / audio_length

        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch

Thanks I tried using your "rate_apply" (not using g2p) and finetuned using Indonesian speech dataset from Common Voice 13, and works and the result is also good although using only 1706 samples.

here is the result:

https://github.com/huggingface/dataspeech/assets/10645543/91449791-092e-43bc-ba17-b98512d74231

ylacombe commented 6 months ago

Hey @yoesak,, thanks for sharing the sample, it looks really great! Would you be potentially interested in sharing the model publicly? (also cc @ittailup in case you'd be interested as well!)

yoesak commented 6 months ago

Hey @yoesak,, thanks for sharing the sample, it looks really great! Would you be potentially interested in sharing the model publicly? (also cc @ittailup in case you'd be interested as well!)

Yes, but the model is not stable yet, in my finding, if I use espeak backend for larger amount data, I got memory leak, so I decided to use custom phoneme module, since I only need for Indonesian language. soon after I finished the training, I will let you know.

ylacombe commented 6 months ago

if I use espeak backend for larger amount data, I got memory leak This is interesting, have you tried using a traditional LLM tokenizer?

How is the training going? Let me know if I can help!

manigandanp commented 3 months ago

et. this was my "rate_apply" (written by Claud

This method had some issues when I use it for Tamil language. Here is the updated version


from phonemizer.backend import EspeakBackend

backend = EspeakBackend("ta")

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[text_column_name], list):
        speaking_rates = []
        phonemes_list = []
        if "speech_duration" in batch:
            for text, audio_duration in zip(
                batch[text_column_name], batch["speech_duration"]
            ):
                phonemes = backend.phonemize(text, strip=True)[0]
                audio_duration = audio_duration if audio_duration != 0 else 0.01
                speaking_rate = len(phonemes) / audio_duration
                speaking_rates.append(speaking_rate)
                phonemes_list.append(phonemes)
        else:
            for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
                phonemes = backend.phonemize(text, strip=True)[0]

                sample_rate = audio["sampling_rate"]
                audio_length = len(audio["array"].squeeze()) / sample_rate

                speaking_rate = len(phonemes) / audio_length

                speaking_rates.append(speaking_rate)
                phonemes_list.append(phonemes)

        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = backend.phonemize(list(batch[text_column_name]), strip=True)[0]
        print(phonemes)
        if "speech_duration" in batch:
            audio_length = (
                batch["speech_duration"] if batch["speech_duration"] != 0 else 0.01
            )
        else:
            sample_rate = batch[audio_column_name]["sampling_rate"]
            audio_length = (
                len(batch[audio_column_name]["array"].squeeze()) / sample_rate
            )

        speaking_rate = len(phonemes) / audio_length

        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch```

dongkeun-livetoon commented 2 weeks ago

I'm trying to use this code for japanese :

kakasi = pykakasi.kakasi()
backend = BACKENDS["espeak"]("ja", language_switch="remove-flags") # backend initiate just once, faster for multi-processing
separator = Separator('|', '', ' ')

I’m not sure if this works; just sharing.

huggingface / dataspeech

other language support? #3