jupyter notebooks to fine tune whisper models on vietnamese using kaggle (should also work on colab but not throughly tested)
using my collection of vietnamese speech datasets: https://huggingface.co/collections/doof-ferb/vietnamese-speech-dataset-65c6af8c15c9950537862fa6
N.B.1 import any trainer or pipeline class from transformers
crash kaggle TPU session (see huggingface/transformers#28609) so better use GPU
N.B.2 trainer class from by default trainer use naive model parallelism which cannot fully use all gpu in same time, so better use distributed data parallelismtransformers
can auto use multi-GPU like kaggle free T4×2 without code change
N.B.3 use default greedy search, because beam search trigger a spike in VRAM usage which may cause out-of-memory (original whisper use num beams = 5, something like do_sample=True, num_beams=5
)
N.B.4 if use kaggle + resume training, remember to enable files persistency before launching
evaluate accuracy (WER) with batched inference:
fine-tune whisper tiny with traditional approach:
fine-tine whisper large with PEFT-LoRA + int8:
(testing - not always working) fine-tune wav2vec v2 bert: w2v-bert-v2.ipynb
docker image to run on AWS EC2: Dockerfile, comes with standalone scripts
convert to openai-whisper
, whisper.cpp
, faster-whisper
, ONNX, TensorRT: not yet
miscellaneous: convert to huggingface audio datasets format