Closed pony5551 closed 7 months ago
测试代码
from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks
asr_model_path = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" vad_model_path = "damo/speech_fsmn_vad_zh-cn-16k-common-pytorch" punc_model_path = "damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch"
inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model=asr_model_path, vad_model=vad_model_path, punc_model=punc_model_path, )
audio_in='/mnt/d/wsl/wav/18955588888.wav' rec_result = inference_pipeline(audio_in=audio_in) print(rec_result)
推理结果 {'text': '幺八九五五四五三三三三三。', 'text_postprocessed': '幺八九五五四五三三三三三', 'sentences': []}
inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model=asr_model_path,
#punc_model=punc_model_path,
)
推理结果 {'text': '幺八九五五五八八八八八'}
可以得到结论,不使用vad的推理结果更准确
I am also certain that there is indeed a problem, and I have verified that it is not related to the quantization of VAD and timestamp speech recognition models, but rather to the implementation of VAD C++code. For the same speech (especially on short speech, it is more obvious because deletion errors increase), there is a significant difference in time points between Python vad inference and C++vad inference, and C++is more aggressive, even resulting in the problem of missing speech.
我也确定确实存在问题,并且已经验证了它与VAD和时间戳语音识别模型的量化无关,而是与VAD C++代码的实现有关。对于相同的语音(尤其是在短语音上,因为删除错误增加而更加明显),Python vad 推理和 C++ 推理在时间点上存在显着差异,并且 C++ 更具攻击性,甚至导致缺少语音的问题。
同样的在python使用vad与不使用vad也会存在差异
我尝试调试了代码发现以下问题
不使用vad时 参数 preprocess_args=speech2text.asr_train_args, loader = build_streaming_iterator( task_name="asr", preprocess_args=speech2text.asr_train_args, data_path_and_name_and_type=data_path_and_name_and_type, dtype=dtype, fs=fs, batch_size=batch_size, key_file=key_file, num_workers=num_workers, )
使用vad时 参数 preprocess_args=None, loader = build_streaming_iterator( task_name="asr", preprocess_args=None, data_path_and_name_and_type=data_path_and_name_and_type, dtype=dtype, fs=fs, batch_size=1, key_file=key_file, num_workers=num_workers, )
这样就导致两个结果不一样,使用vad时 推理结果不准确,这是bug吗?这个现在有没有解决办法,功力有限感觉无从入手
同样升级到funasr最新版本测试,问题依然存在
使用vad的代码 from funasr import AutoModel
model = AutoModel(model="damo/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch", model_revision="v2.0.4", vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch", vad_model_revision="v2.0.4",
# punc_model_revision="v2.0.4",
# spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
# spk_model_revision="v2.0.2"
)
audio_in='/mnt/d/wsl/wav/18955588888.wav' res = model.generate(input=audio_in, hotword='五五五 八八八') print(res)
推理结果:[{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '幺 八 九 嗯 嗯 嗯 八 八 八 八 八', 'timestamp': [[550, 790], [990, 1230], [1570, 1995], [2890, 3130], [3410, 3650], [3990, 4415], [5250, 5490], [5710, 5950], [6250, 6615], [7470, 7710], [7970, 8210]]}]
把五五五识别成了嗯 嗯 嗯
不使用vad的代码 from funasr import AutoModel
model = AutoModel(model="damo/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch", model_revision="v2.0.4",
# vad_model_revision="v2.0.4",
# punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
# punc_model_revision="v2.0.4",
# spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
# spk_model_revision="v2.0.2"
)
audio_in='/mnt/d/wsl/wav/18955588888.wav' res = model.generate(input=audio_in, hotword='五五五 八八八') print(res)
推理结果:[{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '幺 八 九 五 五 五 八 八 八 八 八', 'timestamp': [[510, 750], [990, 1230], [1590, 1830], [2870, 3110], [3430, 3670], [4030, 4270], [5210, 5450], [5710, 5950], [6230, 6470], [7350, 7590], [7830, 8185]]}]
可以看到结果准确无误
但是确实可以通过双声道转为单声道解决
🐛 Bug
原本以为是onnx的推理问题,一直困扰我,但是昨天我从另外一个issue中的代码突发奇想,把vad,punc全部注释掉,结果发现问题的根源就是vad模型的问题
speech_fsmn_vad_zh-cn-16k-common-pytorch 对应 speech_fsmn_vad_zh-cn-16k-common-onnx //但是我没有找到docker版本关闭vad的方法
最后测试时发现只要开启了vad那识别结果有时会不准确误差还比较明显期
期待解决这个问题
另一个issue地址 https://github.com/alibaba-damo-academy/FunASR/issues/1431