Open mequanent opened 11 months ago
Hey @mequanent! What I'd advise you do in this circumstance is:
from transformers import WhisperTokenizer, WhisperForConditionalGeneration
tokenizer = WhisperTokenizer.from_pretrained("username/repo-id") #Â replace with repo-id where you've pushed your new tokenizer model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.resize_token_embeddings(len(tokenizer))
3. Pass this model to the trainer as before
Dear @sanchit-gandhi, I was following your tutorial, Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers, to fine-tune Whisper with a dataset in the Amharic language. Amharic is used in Whisper training as speech-translation only, [Amharic audio -> corresponding English translation text]. Hence the Amharic alphabets are unseen in Whisper training. The dataset I am trying to fine-tune with is [Amharic audio -> corresponding text in Amharic characters]. It consists of 92.28 hours (32901 instances) for training and 9.12 hours (3139 instances) for the testing set. My data sources are:
I tried the tiny, base, and small model sizes. In my first run with whisper-small, I observed a bad performance but when tried to play around with some parameters, including the model size, I was unable to run the code even. I am not quite sure how to introduce the Amharic language characters other than giving the corresponding text as I have seen in the Hindi example. I would appreciate your comment regarding the language whose characters were not seen in the Whisper training because it was treated as a speech translation only. Thank you!