neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Apache License 2.0
13.2k stars 1.82k forks source link

How to customize for another language #567

Open rohanjhanepal opened 1 year ago

rohanjhanepal commented 1 year ago

I am not able to customize it, how can I customize it for another language?

Mayank-Sharma-27 commented 1 year ago

Hey did you try to figure this out? I want to customise for Hindi

ctimict commented 1 year ago

I'm also trying to find customizations for other languages but can't find the documentation, can someone help me, please!

merolaika commented 1 year ago

Adding a new voice To add new voices to Tortoise, you will need to do the following:

Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing. Save the clips as a WAV file with floating point format and a 22,050 sample rate. Create a subdirectory in voices/ Put your clips in that subdirectory. Run tortoise utilities with --voice=.

Picking good reference clips As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking good clips:

Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them. Avoid speeches. These generally have distortion caused by the amplification system. Avoid clips from phone calls. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book. The text being spoken in the clips does not matter, but diverse text does seem to perform better.

manmay-nakhashi commented 1 year ago

Just gather a lot of data in a targeted language like 10k hours of data , train your own bpe tokenizer and fine-tune autoregressive model from DL-Art- School.

Fizikaz commented 1 year ago

Just gather a lot of data in a targeted language like 10k hours of data , train your own bpe tokenizer and fine-tune autoregressive model from DL-Art- School.

Have you done it? Is there any YT tutorial on this?

aklacar1 commented 11 months ago

@manmay-nakhashi Quick question, Every tokenizer I train results in gibberish after training using DLAS. I am trying to create tokenizer that works, any tips here ? Only tokenizer that works is: https://huggingface.co/AOLCDROM/Tortoise-TTS-de/tree/main but it is not good enough.

aklacar1 commented 11 months ago

@manmay-nakhashi What would be needed if we want to use larger tokenizer with DLAS and Tortoise ? e.g. 512 token ?

manmay-nakhashi commented 11 months ago

@aklacar1 you might need to modify a few things to support a larger tokenizer, but keeping the tokenizer at 256 will work out of the box , if you have enough data ~10k hours.

Fizikaz commented 11 months ago

Just gather a lot of data in a targeted language like 10k hours of data , train your own bpe tokenizer and fine-tune autoregressive model from DL-Art- School.

How many speakers should dataset strive for? Would be 1k hours of data enough? Also, it is not clear, how the splitting should be done, does it need to split by the sentences or 10s chunks, split by silence?

GuenKainto commented 10 months ago

Hello, can I ask how to create a tokenizer file for Japanese? I see that Japanese people use some Kanji in sentences or words. I found a simple tokenizer file containing hiragana and katakana words, I think I can use it but it have litter "merges" words and Kanji words. File link: https://git.ecker.tech/mrq/ai-voice-cloning/src/branch/master/models/tokenizers/japanese.json

GuenKainto commented 10 months ago

@aklacar1 you might need to modify a few things to support a larger tokenizer, but keeping the tokenizer at 256 will work out of the box , if you have enough data ~10k hours.

Hi, i try to create a tokenizer for japanese but it have vocal_size is 3000 (in 7696 lines of text) what thing i should modify for training ? Thankyou.

super-animo commented 7 months ago

Can anyone share output for hindi cloned audio?