vasistalodagala / whisper-finetune

Fine-tune and evaluate Whisper models for Automatic Speech Recognition (ASR) on custom datasets or datasets from huggingface.
MIT License
262 stars 57 forks source link

how to add a new language? #8

Open yfliao opened 1 year ago

yfliao commented 1 year ago

Dear All,

I would like to recognize Taiwanese Hakka speech using fine-tuned Whisper. However, Hakka is not supported by WhisperTokenizer. Any idea?

Here is my code and log:

ngpu=10  # number of GPUs to perform distributed training on.

torchrun --nproc_per_node=${ngpu} train/fine-tune_on_custom_dataset.py \
--model_name vasista22/whisper-telugu-base \
--language hakka \
--sampling_rate 16000 \
--num_proc 4 \
--train_strategy epoch \
--learning_rate 3e-3 \
--warmup 1000 \
--train_batchsize 16 \
--eval_batchsize 8 \
--num_epochs 20 \
--resume_from_ckpt None \
--output_dir op_dir_epoch \
--train_datasets output_data/train  \
--eval_datasets output_data/dev output_data/test

ValueError: Unsupported language: hakka. Language should be one of: ['english', 'chinese', 'german', 'spanish', 'russian', 'korean', 'french', 'japanese', 'portuguese', 'turkish', 'polish', 'catalan', 'dutch', 'arabic', 'swedish', 'italian', 'indonesian', 'hindi', 'finnish', 'vietnamese', 'hebrew', 'ukrainian', 'greek', 'malay', 'czech', 'romanian', 'danish', 'hungarian', 'tamil', 'norwegian', 'thai', 'urdu', 'croatian', 'bulgarian', 'lithuanian', 'latin', 'maori', 'malayalam', 'welsh', 'slovak', 'telugu', 'persian', 'latvian', 'bengali', 'serbian', 'azerbaijani', 'slovenian', 'kannada', 'estonian', 'macedonian', 'breton', 'basque', 'icelandic', 'armenian', 'nepali', 'mongolian', 'bosnian', 'kazakh', 'albanian', 'swahili', 'galician', 'marathi', 'punjabi', 'sinhala', 'khmer', 'shona', 'yoruba', 'somali', 'afrikaans', 'occitan', 'georgian', 'belarusian', 'tajik', 'sindhi', 'gujarati', 'amharic', 'yiddish', 'lao', 'uzbek', 'faroese', 'haitian creole', 'pashto', 'turkmen', 'nynorsk', 'maltese', 'sanskrit', 'luxembourgish', 'myanmar', 'tibetan', 'tagalog', 'malagasy', 'assamese', 'tatar', 'hawaiian', 'lingala', 'hausa', 'bashkir', 'javanese', 'sundanese', 'burmese', 'valencian', 'flemish', 'haitian', 'letzeburgesch', 'pushto', 'panjabi', 'moldavian', 'moldovan', 'sinhalese', 'castilian'].
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
    example = apply_function_on_filtered_inputs(example, i, offset=offset)
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/usr1/liao/whisper-hakka/train/fine-tune_on_custom_dataset.py", line 198, in prepare_dataset
    batch["labels"] = processor.tokenizer(transcription).input_ids
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2538, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2644, in _call_one
    return self.encode_plus(
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2717, in encode_plus
    return self._encode_plus(
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 652, in _encode_plus
    return self.prepare_for_model(
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3156, in prepare_for_model
    total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 479, in num_special_tokens_to_add
    return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 428, in build_inputs_with_special_tokens
    return self.prefix_tokens + token_ids_0 + [self.eos_token_id]
  File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 406, in prefix_tokens
    raise ValueError(
ValueError: Unsupported language: hakka. Language should be one of: ['english', 'chinese', 'german', 'spanish', 'russian', 'korean', 'french', 'japanese', 'portuguese', 'turkish', 'polish', 'catalan', 'dutch', 'arabic', 'swedish', 'italian', 'indonesian', 'hindi', 'finnish', 'vietnamese', 'hebrew', 'ukrainian', 'greek', 'malay', 'czech', 'romanian', 'danish', 'hungarian', 'tamil', 'norwegian', 'thai', 'urdu', 'croatian', 'bulgarian', 'lithuanian', 'latin', 'maori', 'malayalam', 'welsh', 'slovak', 'telugu', 'persian', 'latvian', 'bengali', 'serbian', 'azerbaijani', 'slovenian', 'kannada', 'estonian', 'macedonian', 'breton', 'basque', 'icelandic', 'armenian', 'nepali', 'mongolian', 'bosnian', 'kazakh', 'albanian', 'swahili', 'galician', 'marathi', 'punjabi', 'sinhala', 'khmer', 'shona', 'yoruba', 'somali', 'afrikaans', 'occitan', 'georgian', 'belarusian', 'tajik', 'sindhi', 'gujarati', 'amharic', 'yiddish', 'lao', 'uzbek', 'faroese', 'haitian creole', 'pashto', 'turkmen', 'nynorsk', 'maltese', 'sanskrit', 'luxembourgish', 'myanmar', 'tibetan', 'tagalog', 'malagasy', 'assamese', 'tatar', 'hawaiian', 'lingala', 'hausa', 'bashkir', 'javanese', 'sundanese', 'burmese', 'valencian', 'flemish', 'haitian', 'letzeburgesch', 'pushto', 'panjabi', 'moldavian', 'moldovan', 'sinhalese', 'castilian'].
"""
mujhenahiata commented 1 month ago

how are you finetuning @yfliao