Closed GinUTE closed 11 months ago
Thank you for your explanation. I understand the problem a little bit better now. I still have two questions.
Does this oversight on my part affect how the model learns and the synthesized speech? As far as I am concerned, the model is currently getting these ▁
tokens in the input_ids
. I did proceed with the fine-tuning and achieve respectable results; the synthesized speech is coherent but definitely could be better. So far, I have only done subjective evaluations.
I also want to resolve this tokenizer issue to see if there is any accuracy gain. I believe I need to add a Vietnamese word lexicon to the tokenizer and resize the token embedding layer of the model accordingly. Is this a possible solution? If not, do you have any other suggestions?
I have attached the training and validation loss while fine-tuning for you. loss.pdf
Description
I am fine-tuning the SpeechT5 model for text-to-speech task with the Vietnamese corpus from Common Voice. I tried to add token mappings for Vietnamese characters with diacritics using this code snippet:
This successfully added new tokens to the vocabulary set. However, when I encoded the input text and passed its
input_ids
back into the decoder, spaces appeared in the resultant string but only before and after the newly added tokens. Inspecting theinput_ids
reveals extra tokens withid = 4
, which is▁
according to the vocabulary set.Here is the
input_ids
:And the decoded string:
I expect it to be more like this:
Questions
To reproduce
Any insight is greatly appreciated.