microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.09k stars 113 forks source link

SpeechT5 - TTS - Tokenizer adding `▁` token between newly added Vietnamese characters #63

Closed GinUTE closed 6 months ago

GinUTE commented 8 months ago

Description

I am fine-tuning the SpeechT5 model for text-to-speech task with the Vietnamese corpus from Common Voice. I tried to add token mappings for Vietnamese characters with diacritics using this code snippet:

new_tokens = [
    char
    for char in "àáãạảăắằẳẵặâấầẩẫậèéẹẻẽêềếểễệđìíĩỉịòóõọỏôốồổỗộơớờởỡợùúũụủưứừửữựỳỵỷỹýÀÁÃẠẢĂẮẰẲẴẶÂẤẦẨẪẬÈÉẸẺẼÊỀẾỂỄỆĐÌÍĨỈỊÒÓÕỌỎÔỐỒỔỖỘƠỚỜỞỠỢÙÚŨỤỦƯỨỪỬỮỰỲỴỶỸÝ"
]
tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)

This successfully added new tokens to the vocabulary set. However, when I encoded the input text and passed its input_ids back into the decoder, spaces appeared in the resultant string but only before and after the newly added tokens. Inspecting the input_ids reveals extra tokens with id = 4, which is according to the vocabulary set.

Here is the input_ids:

tensor([[4, 35, 118, 4, 18, 4, 9, 7, 22, 4, 25, 84, 4, 9, 4, 17, 114, 4, 28, 11, 117, 4, 5, 4, 28, 11, 118, 4, 9, 21, 41, 2]])

And the decoded string:

H ô m nay b ạ n c ó kh ỏ e kh ô ng?</s>

I expect it to be more like this:

Hôm nay bạn có khỏe không?</s>

Questions

  1. Is this an expected behavior? As I also understand I might not be able to do a round trip and end up with the original string before encoding. If that is the case, would this affect the speech synthesis in any way?
  2. If this is not a normal behavior and more likely to be on my part, how can I rectify this problem?

To reproduce

!pip3 install sentencepiece
!pip3 install git+https://github.com/huggingface/transformers.git

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")

tokenizer = processor.tokenizer
new_tokens = [
    char
    for char in "àáãạảăắằẳẵặâấầẩẫậèéẹẻẽêềếểễệđìíĩỉịòóõọỏôốồổỗộơớờởỡợùúũụủưứừửữựỳỵỷỹýÀÁÃẠẢĂẮẰẲẴẶÂẤẦẨẪẬÈÉẸẺẼÊỀẾỂỄỆĐÌÍĨỈỊÒÓÕỌỎÔỐỒỔỖỘƠỚỜỞỠỢÙÚŨỤỦƯỨỪỬỮỰỲỴỶỸÝ"
]
tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)

text = "Hôm nay bạn có khỏe không?"
tokenizer.decode(tokenizer(text)["input_ids"])

Any insight is greatly appreciated.

GinUTE commented 8 months ago

Thank you for your explanation. I understand the problem a little bit better now. I still have two questions.

Does this oversight on my part affect how the model learns and the synthesized speech? As far as I am concerned, the model is currently getting these tokens in the input_ids. I did proceed with the fine-tuning and achieve respectable results; the synthesized speech is coherent but definitely could be better. So far, I have only done subjective evaluations.

I also want to resolve this tokenizer issue to see if there is any accuracy gain. I believe I need to add a Vietnamese word lexicon to the tokenizer and resize the token embedding layer of the model accordingly. Is this a possible solution? If not, do you have any other suggestions?

I have attached the training and validation loss while fine-tuning for you. loss.pdf