huggingface / blog

Public repo for HF blog posts
https://hf.co/blog
2.39k stars 752 forks source link

How to introduce new alphabets in Whisper fine-tuning #1702

Open mequanent opened 11 months ago

mequanent commented 11 months ago

Dear @sanchit-gandhi, I was following your tutorial, Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers, to fine-tune Whisper with a dataset in the Amharic language. Amharic is used in Whisper training as speech-translation only, [Amharic audio -> corresponding English translation text]. Hence the Amharic alphabets are unseen in Whisper training. The dataset I am trying to fine-tune with is [Amharic audio -> corresponding text in Amharic characters]. It consists of 92.28 hours (32901 instances) for training and 9.12 hours (3139 instances) for the testing set. My data sources are:

  1. https://github.com/getalp/ALFFA_PUBLIC/tree/master/ASR/AMHARIC and
  2. https://www.findke.ovgu.de/findke/en/Research/Data+Sets/Amharic+Speech+Corpus.html

I tried the tiny, base, and small model sizes. In my first run with whisper-small, I observed a bad performance but when tried to play around with some parameters, including the model size, I was unable to run the code even. I am not quite sure how to introduce the Amharic language characters other than giving the corresponding text as I have seen in the Hindi example. I would appreciate your comment regarding the language whose characters were not seen in the Whisper training because it was treated as a speech translation only. Thank you!

sanchit-gandhi commented 11 months ago

Hey @mequanent! What I'd advise you do in this circumstance is:

  1. Train a new BPE tokenizer from the Whisper tokenizer, using your target transcriptions in Amharic https://huggingface.co/learn/nlp-course/chapter6/2. The tokenizer now includes Amharic characters and sub-word tokens
  2. Resize the model embedding layer to your new tokenizer length:
    
    from transformers import WhisperTokenizer, WhisperForConditionalGeneration

load pre-trained tokenizer and model

tokenizer = WhisperTokenizer.from_pretrained("username/repo-id") # replace with repo-id where you've pushed your new tokenizer model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

add new random embeddings for the new tokenizer

model.resize_token_embeddings(len(tokenizer))


3. Pass this model to the trainer as before