Open yfliao opened 1 year ago
Dear All,
I would like to recognize Taiwanese Hakka speech using fine-tuned Whisper. However, Hakka is not supported by WhisperTokenizer. Any idea?
Here is my code and log:
ngpu=10 # number of GPUs to perform distributed training on. torchrun --nproc_per_node=${ngpu} train/fine-tune_on_custom_dataset.py \ --model_name vasista22/whisper-telugu-base \ --language hakka \ --sampling_rate 16000 \ --num_proc 4 \ --train_strategy epoch \ --learning_rate 3e-3 \ --warmup 1000 \ --train_batchsize 16 \ --eval_batchsize 8 \ --num_epochs 20 \ --resume_from_ckpt None \ --output_dir op_dir_epoch \ --train_datasets output_data/train \ --eval_datasets output_data/dev output_data/test ValueError: Unsupported language: hakka. Language should be one of: ['english', 'chinese', 'german', 'spanish', 'russian', 'korean', 'french', 'japanese', 'portuguese', 'turkish', 'polish', 'catalan', 'dutch', 'arabic', 'swedish', 'italian', 'indonesian', 'hindi', 'finnish', 'vietnamese', 'hebrew', 'ukrainian', 'greek', 'malay', 'czech', 'romanian', 'danish', 'hungarian', 'tamil', 'norwegian', 'thai', 'urdu', 'croatian', 'bulgarian', 'lithuanian', 'latin', 'maori', 'malayalam', 'welsh', 'slovak', 'telugu', 'persian', 'latvian', 'bengali', 'serbian', 'azerbaijani', 'slovenian', 'kannada', 'estonian', 'macedonian', 'breton', 'basque', 'icelandic', 'armenian', 'nepali', 'mongolian', 'bosnian', 'kazakh', 'albanian', 'swahili', 'galician', 'marathi', 'punjabi', 'sinhala', 'khmer', 'shona', 'yoruba', 'somali', 'afrikaans', 'occitan', 'georgian', 'belarusian', 'tajik', 'sindhi', 'gujarati', 'amharic', 'yiddish', 'lao', 'uzbek', 'faroese', 'haitian creole', 'pashto', 'turkmen', 'nynorsk', 'maltese', 'sanskrit', 'luxembourgish', 'myanmar', 'tibetan', 'tagalog', 'malagasy', 'assamese', 'tatar', 'hawaiian', 'lingala', 'hausa', 'bashkir', 'javanese', 'sundanese', 'burmese', 'valencian', 'flemish', 'haitian', 'letzeburgesch', 'pushto', 'panjabi', 'moldavian', 'moldovan', 'sinhalese', 'castilian']. multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue for i, result in enumerate(func(**kwargs)): File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "/usr1/liao/whisper-hakka/train/fine-tune_on_custom_dataset.py", line 198, in prepare_dataset batch["labels"] = processor.tokenizer(transcription).input_ids File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2538, in __call__ encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs) File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2644, in _call_one return self.encode_plus( File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2717, in encode_plus return self._encode_plus( File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 652, in _encode_plus return self.prepare_for_model( File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3156, in prepare_for_model total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0) File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 479, in num_special_tokens_to_add return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None)) File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 428, in build_inputs_with_special_tokens return self.prefix_tokens + token_ids_0 + [self.eos_token_id] File "/home/liao/anaconda3/envs/pytorch/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 406, in prefix_tokens raise ValueError( ValueError: Unsupported language: hakka. Language should be one of: ['english', 'chinese', 'german', 'spanish', 'russian', 'korean', 'french', 'japanese', 'portuguese', 'turkish', 'polish', 'catalan', 'dutch', 'arabic', 'swedish', 'italian', 'indonesian', 'hindi', 'finnish', 'vietnamese', 'hebrew', 'ukrainian', 'greek', 'malay', 'czech', 'romanian', 'danish', 'hungarian', 'tamil', 'norwegian', 'thai', 'urdu', 'croatian', 'bulgarian', 'lithuanian', 'latin', 'maori', 'malayalam', 'welsh', 'slovak', 'telugu', 'persian', 'latvian', 'bengali', 'serbian', 'azerbaijani', 'slovenian', 'kannada', 'estonian', 'macedonian', 'breton', 'basque', 'icelandic', 'armenian', 'nepali', 'mongolian', 'bosnian', 'kazakh', 'albanian', 'swahili', 'galician', 'marathi', 'punjabi', 'sinhala', 'khmer', 'shona', 'yoruba', 'somali', 'afrikaans', 'occitan', 'georgian', 'belarusian', 'tajik', 'sindhi', 'gujarati', 'amharic', 'yiddish', 'lao', 'uzbek', 'faroese', 'haitian creole', 'pashto', 'turkmen', 'nynorsk', 'maltese', 'sanskrit', 'luxembourgish', 'myanmar', 'tibetan', 'tagalog', 'malagasy', 'assamese', 'tatar', 'hawaiian', 'lingala', 'hausa', 'bashkir', 'javanese', 'sundanese', 'burmese', 'valencian', 'flemish', 'haitian', 'letzeburgesch', 'pushto', 'panjabi', 'moldavian', 'moldovan', 'sinhalese', 'castilian']. """
how are you finetuning @yfliao
Dear All,
I would like to recognize Taiwanese Hakka speech using fine-tuned Whisper. However, Hakka is not supported by WhisperTokenizer. Any idea?
Here is my code and log: