A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
For finetuning my model, should I prepare audio data less than 15s? I have lots of audios longer than 1 minute, should I split them manually, or there are other convenient ways? Can I use the vad model during fine-tune process?
What is your question?
For finetuning my model, should I prepare audio data less than 15s? I have lots of audios longer than 1 minute, should I split them manually, or there are other convenient ways? Can I use the vad model during fine-tune process?
What's your environment?
pip
):