Dear @sanchit-gandhi, I was following your tutorial, Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers, to fine-tune Whisper with a dataset in Amharic language. Amharic is used in Whisper training as speech-translation only, [Amharic audio -> corresponding English translation text]. Hence the Amharic alphabets are unseen in Whisper training. The dataset I am trying to fine-tune with is [Amharic audio -> corresponding text in Amharic characters]. It consists of 92.28 hours (32901 instances) for training and 9.12 hours (3139 instances) for testing set. My data sources are:

I tried the tiny, base and small model sizes. In my first run with whisper-small, I observed a bad performance but when tried to play around some parameters, including the model size, I was unable to run the code even. I am not quite sure how to introduce the Amharic language characters other than giving the corresponding text as I have seen in the Hindi example. It may also be related to properly including a code segment for using GPU or cache handling. I am using a local server with three GPUs each with around 10 GB memory. I will appreciate your comment on configuring with local server GPU or regarding the language whose characters were not seen in the Whisper training, because it was treated as a speech translation only. Thank you!

Hey @mequanent! What I'd advise you do in this circumstance is:

Train a new BPE tokenizer from the Whisper tokenizer, using your target transcriptions in Amharic https://huggingface.co/learn/nlp-course/chapter6/2. The tokenizer now includes Amharic characters and sub-word tokens

Resize the model embedding layer to your new tokenizer length:


from transformers import WhisperTokenizer, WhisperForConditionalGeneration

load pre-trained tokenizer and model

tokenizer = WhisperTokenizer.from_pretrained("username/repo-id") # replace with repo-id where you've pushed your new tokenizer model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

add new random embeddings for the new tokenizer

model.resize_token_embeddings(len(tokenizer))


3. Pass this model to the trainer as before

huggingface / blog

How to introduce new alphabets in Whisper fine-tuning #1702

load pre-trained tokenizer and model

add new random embeddings for the new tokenizer