huggingface / blog

Public repo for HF blog posts
https://hf.co/blog
2.17k stars 665 forks source link

How to introduce new alphabets in Whisper fine-tuning #1702

Open mequanent opened 7 months ago

mequanent commented 7 months ago

Dear @sanchit-gandhi, I was following your tutorial, Fine-Tune Whisper For Multilingual ASR with šŸ¤— Transformers, to fine-tune Whisper with a dataset in Amharic language. Amharic is used in Whisper training as speech-translation only, [Amharic audio -> corresponding English translation text]. Hence the Amharic alphabets are unseen in Whisper training. The dataset I am trying to fine-tune with is [Amharic audio -> corresponding text in Amharic characters]. It consists of 92.28 hours (32901 instances) for training and 9.12 hours (3139 instances) for testing set. My data sources are:

  1. https://github.com/getalp/ALFFA_PUBLIC/tree/master/ASR/AMHARIC and
  2. https://www.findke.ovgu.de/findke/en/Research/Data+Sets/Amharic+Speech+Corpus.html

I tried the tiny, base and small model sizes. In my first run with whisper-small, I observed a bad performance but when tried to play around some parameters, including the model size, I was unable to run the code even. I am not quite sure how to introduce the Amharic language characters other than giving the corresponding text as I have seen in the Hindi example. It may also be related to properly including a code segment for using GPU or cache handling. I am using a local server with three GPUs each with around 10 GB memory. I will appreciate your comment on configuring with local server GPU or regarding the language whose characters were not seen in the Whisper training, because it was treated as a speech translation only. Thank you!

sanchit-gandhi commented 7 months ago

Hey @mequanent! What I'd advise you do in this circumstance is:

  1. Train a new BPE tokenizer from the Whisper tokenizer, using your target transcriptions in Amharic https://huggingface.co/learn/nlp-course/chapter6/2. The tokenizer now includes Amharic characters and sub-word tokens
  2. Resize the model embedding layer to your new tokenizer length:
    
    from transformers import WhisperTokenizer, WhisperForConditionalGeneration

load pre-trained tokenizer and model

tokenizer = WhisperTokenizer.from_pretrained("username/repo-id") #Ā replace with repo-id where you've pushed your new tokenizer model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

add new random embeddings for the new tokenizer

model.resize_token_embeddings(len(tokenizer))


3. Pass this model to the trainer as before