Open mequanent opened 7 months ago
Hey @mequanent! What I'd advise you do in this circumstance is:
from transformers import WhisperTokenizer, WhisperForConditionalGeneration
tokenizer = WhisperTokenizer.from_pretrained("username/repo-id") #Ā replace with repo-id where you've pushed your new tokenizer model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.resize_token_embeddings(len(tokenizer))
3. Pass this model to the trainer as before
Dear @sanchit-gandhi, I was following your tutorial, Fine-Tune Whisper For Multilingual ASR with š¤ Transformers, to fine-tune Whisper with a dataset in Amharic language. Amharic is used in Whisper training as speech-translation only, [Amharic audio -> corresponding English translation text]. Hence the Amharic alphabets are unseen in Whisper training. The dataset I am trying to fine-tune with is [Amharic audio -> corresponding text in Amharic characters]. It consists of 92.28 hours (32901 instances) for training and 9.12 hours (3139 instances) for testing set. My data sources are:
I tried the tiny, base and small model sizes. In my first run with whisper-small, I observed a bad performance but when tried to play around some parameters, including the model size, I was unable to run the code even. I am not quite sure how to introduce the Amharic language characters other than giving the corresponding text as I have seen in the Hindi example. It may also be related to properly including a code segment for using GPU or cache handling. I am using a local server with three GPUs each with around 10 GB memory. I will appreciate your comment on configuring with local server GPU or regarding the language whose characters were not seen in the Whisper training, because it was treated as a speech translation only. Thank you!