modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
7.02k stars 747 forks source link

VAD效果很差,是使用问题? #2217

Closed young1013 closed 1 hour ago

young1013 commented 1 hour ago

Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)

❓ Questions and Help

下面的音频wav文件,说话片段和静音片段十分清晰,但实际VAD的效果有如下问题:

1、怎么VAD的结果,片段的 start_time 和上个片段的 end_time是紧挨着的; 2、图中 15305 的时间点不准,明显在 语音内容上;

a.wav.zip

image image

{ "duration": 14125, "start": 1180, "end": 15305, "text": "英国工厂纺织工十八世纪下半叶产业革命。首先#从西欧的纺织业开始,机器的发明,使工人从手工劳动中初步的解脱出来," }, { "duration": 14845, "start": 15305, "end": 30150, "text": "为利用动力驱动的集中型大工业生产方式准备了条件。图为采用固定纺锤的纺织作纺,当时的纺织机不仅需要二至三人操作," }, { "duration": 14020, "start": 30150, "end": 44170, "text": "而且效率低下贪婪的收税官 publcan 一词源于古希腊语,其原意是效劳,后来演化为收税人、" }, { "duration": 13250, "start": 44170, "end": 57420, "text": "酒店老板等意思,在古希腊早期,税收原本是对付出劳动者的奖赏到古罗马时期,收税人被执政官统一任命," },

young1013 commented 1 hour ago

请通过阿里钉 联系 蚂蚁集团 乐天

LauraGPT commented 1 hour ago

一般来说,vad切割音频时间戳与实际音频对不上,都是采样率原因,模型是按照16000采样率给的时间戳,需要你换成成你音频采样率的