huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
4.68k stars 475 forks source link

sharing results of korean(+english) bilingual training #67

Open choihk6610 opened 5 months ago

choihk6610 commented 5 months ago

Hello,

Based on your code, I added Korean tokens (using a Korean emotional dataset) to the tokenizer and fine-tuned the model with the LibriTTS R dataset. The Korean dataset is slightly less than 300 hours, similar in size to LibriTTS R but with fewer speakers. I did not perform separate emotion classification.

I wanted to share the wandb report, but due to security reasons, I am unable to do so. Instead, I am providing the training and evaluation curves as images and some audio files as a compressed file.

Specifically, no preprocessing has been applied to the sentences in korean dataset. The performance turned out to be better than expected, so I believe that if Korean is included in the pretraining stage, it could yield excellent results in the fine-tuning stage. I hope that Korean will be included in the next version of the model.

Thank you.

train eval

samples.zip

chunping-xt commented 5 months ago

Which tokenizer are you using? If it is flan-t5 then I think it does not support Korean language. I'm also facing the problem that flan-t5 doesn't support my language. from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base") tokenizer.decode(tokenizer("어떻게 지내세요?").input_ids)

'\<unk> \<unk>?\'

choihk6610 commented 5 months ago

@chunping-xt I mentioned that I added Korean tokens to the tokenizer and used them in the text.

chunping-xt commented 5 months ago

@choihk6610 I'd appreciate it if you could instruct to add tokens to a language that flan-t5 doesn't support, I'm quite limited in reading codes. Thanks!

choihk6610 commented 5 months ago

@chunping-xt It might be helpful to refer to the add_special_tokens() function from the following link: https://huggingface.co/docs/transformers/main_classes/tokenizer

chunping-xt commented 5 months ago

@choihk6610, i added tokens to flan-t5, but i got error while training, can you share how you used tokenizer after adding tokens? I don't know how to resize_token_embeddings after adding new tokens.

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1") x_unk = ['new_tok',...] tokenizer.add_tokens(x_unk) tokenizer.save_pretrained('/mnt/f/parler-tts/my_Flan-T5')

and train: %cd /mnt/f/parler-tts !accelerate launch ./training/run_parler_tts_training.py \ --model_name_or_path "parler-tts/parler_tts_mini_v0.1" \ --feature_extractor_name "parler-tts/dac_44khZ_8kbps" \ --description_tokenizer_name "/mnt/f/parler-tts/my_Flan-T5" \ --prompt_tokenizer_name "/mnt/f/parler-tts_v1/my_Flan-T5" \ ....

and got many lines: ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [495,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
sanchit-gandhi commented 5 months ago

Awesome results @choihk6610! It's great to hear you were successfully able to fine-tune the base English model for Korean with these steps. The eval loss curve looks great. You can also experiment with a cosine scheduler, which we found to improve performance slightly:

    --lr_scheduler_stype "cosine" \

We shared the samples with our resident Korean speaker, @ArthurZucker, who said:

ko2 sounds really good in terms of musicality

The next version of Parler (v1) might contain some other high-resource European languages, which should improve multilingual fine-tuning further. If we devise a good data cleaning pipeline, we could also clean Yodas2 and train a larger multilingual checkpoint as a follow-up (v2).

lyt719 commented 5 months ago

@chunping-xt It might be helpful to refer to the add_special_tokens() function from the following link: https://huggingface.co/docs/transformers/main_classes/tokenizer

May I ask where your Korean vocab comes from? Do you add all vocabulary from other Korean models to Flan t5?

chunping-xt commented 5 months ago

@lyt719, I want to finetune with a language other than Korean. The problem is that when changing the vocab size, the model need to resize it, and I don't know how to do it. Anyway I chose another solution and it's been great with parler, with only < 5 hours of training with ~12 hours of data with no prompt field I've got a TTS model, It's surprising how fast finetune. There are 2 wishes I have enough capacity to experiment:

  1. Using tokenizer mT5 for multilingual, vocab is much larger than flan-T5, with the same sound length but shorter number of tokens, not sure the training performance is worse than flan-t5, need to experiment to check.
  2. Use the encoder/decoder model instead of the current decoder, the reason is that encoder+decoder has much better control, I compared GPT (decoder) and T5/Bart (encoder/decoder) when training NLP, found T5 superior in control capabilities.
lyt719 commented 5 months ago

@lyt719, I want to finetune with a language other than Korean. The problem is that when changing the vocab size, the model need to resize it, and I don't know how to do it. Anyway I chose another solution and it's been great with parler, with only < 5 hours of training with ~12 hours of data with no prompt field I've got a TTS model, It's surprising how fast finetune. There are 2 wishes I have enough capacity to experiment:

  1. Using tokenizer mT5 for multilingual, vocab is much larger than flan-T5, with the same sound length but shorter number of tokens, not sure the training performance is worse than flan-t5, need to experiment to check.
  2. Use the encoder/decoder model instead of the current decoder, the reason is that encoder+decoder has much better control, I compared GPT (decoder) and T5/Bart (encoder/decoder) when training NLP, found T5 superior in control capabilities.

Thanks! And Do you do these two tricks based on Mini v0.1 finetune or pretrain from scratch? I tryied but i can't change tokeizer to any other in finetune, it may generate a lot errors.

chunping-xt commented 5 months ago

@lyt719 not yet, it's my dream and waiting for someone to care and implement it. The only thing I did was use the existing flan-t5 tokens to represent my language.

lyt719 commented 5 months ago

@lyt719 not yet, it's my dream and waiting for someone to care and implement it. The only thing I did was use the existing flan-t5 tokens to represent my language.

Me too. So how do you solve the language that flan-t5 tokenizer can't encode?

chunping-xt commented 5 months ago

@lyt719, eg: suppose your language is all numbers 0-9, and flan-t5 does not support numeric symbols, but supports abc..xyz character symbols, then you do a mapping of numbers to characters, so you have presented your language in characters. A small note is that the training process uses Whisper to calculate word prediction accuracy, fortunately it does not affect the training quality but is only to monitor the visual training results.

manigandanp commented 3 months ago

@choihk6610, i added tokens to flan-t5, but i got error while training, can you share how you used tokenizer after adding tokens? I don't know how to resize_token_embeddings after adding new tokens.

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base") tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1") x_unk = ['new_tok',...] tokenizer.add_tokens(x_unk) tokenizer.save_pretrained('/mnt/f/parler-tts/my_Flan-T5')

and train: %cd /mnt/f/parler-tts !accelerate launch ./training/run_parler_tts_training.py --model_name_or_path "parler-tts/parler_tts_mini_v0.1" --feature_extractor_name "parler-tts/dac_44khZ_8kbps" --description_tokenizer_name "/mnt/f/parler-tts/my_Flan-T5" --prompt_tokenizer_name "/mnt/f/parler-tts_v1/my_Flan-T5" ....

and got many lines: ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [495,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

@chunping-xt have you tried resizing the tokenizer?

model.resize_token_embeddings(len(tokenizer))

You may have already found a solution, but I'm sharing this for anyone else who is attempting this for the first time. If you've discovered alternative methods for fine-tuning this model on a non-English dataset, please share your strategy and how you accomplished it.