modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.8k stars 720 forks source link

关于近期docker中onnx模型推理与python推理不同,我发现了不准确的问题的根源 #1434

Closed pony5551 closed 7 months ago

pony5551 commented 8 months ago

🐛 Bug

原本以为是onnx的推理问题,一直困扰我,但是昨天我从另外一个issue中的代码突发奇想,把vad,punc全部注释掉,结果发现问题的根源就是vad模型的问题

speech_fsmn_vad_zh-cn-16k-common-pytorch 对应 speech_fsmn_vad_zh-cn-16k-common-onnx //但是我没有找到docker版本关闭vad的方法

最后测试时发现只要开启了vad那识别结果有时会不准确误差还比较明显期

期待解决这个问题

另一个issue地址 https://github.com/alibaba-damo-academy/FunASR/issues/1431

pony5551 commented 8 months ago

测试代码

from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks

asr_model_path = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" vad_model_path = "damo/speech_fsmn_vad_zh-cn-16k-common-pytorch" punc_model_path = "damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch"

inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model=asr_model_path, vad_model=vad_model_path, punc_model=punc_model_path, )

audio_in='/mnt/d/wsl/wav/18955588888.wav' rec_result = inference_pipeline(audio_in=audio_in) print(rec_result)

推理结果 {'text': '幺八九五五四五三三三三三。', 'text_postprocessed': '幺八九五五四五三三三三三', 'sentences': []}

-------------------------------------

注释掉vad和punc

inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model=asr_model_path,

vad_model=vad_model_path,

#punc_model=punc_model_path,

)

推理结果 {'text': '幺八九五五五八八八八八'}

可以得到结论,不使用vad的推理结果更准确

18955588888.zip

lhanzl commented 8 months ago

I am also certain that there is indeed a problem, and I have verified that it is not related to the quantization of VAD and timestamp speech recognition models, but rather to the implementation of VAD C++code. For the same speech (especially on short speech, it is more obvious because deletion errors increase), there is a significant difference in time points between Python vad inference and C++vad inference, and C++is more aggressive, even resulting in the problem of missing speech.

pony5551 commented 8 months ago

我也确定确实存在问题,并且已经验证了它与VAD和时间戳语音识别模型的量化无关,而是与VAD C++代码的实现有关。对于相同的语音(尤其是在短语音上,因为删除错误增加而更加明显),Python vad 推理和 C++ 推理在时间点上存在显着差异,并且 C++ 更具攻击性,甚至导致缺少语音的问题。

同样的在python使用vad与不使用vad也会存在差异

我尝试调试了代码发现以下问题

不使用vad时 参数 preprocess_args=speech2text.asr_train_args, loader = build_streaming_iterator( task_name="asr", preprocess_args=speech2text.asr_train_args, data_path_and_name_and_type=data_path_and_name_and_type, dtype=dtype, fs=fs, batch_size=batch_size, key_file=key_file, num_workers=num_workers, )

使用vad时 参数 preprocess_args=None, loader = build_streaming_iterator( task_name="asr", preprocess_args=None, data_path_and_name_and_type=data_path_and_name_and_type, dtype=dtype, fs=fs, batch_size=1, key_file=key_file, num_workers=num_workers, )

这样就导致两个结果不一样,使用vad时 推理结果不准确,这是bug吗?这个现在有没有解决办法,功力有限感觉无从入手

pony5551 commented 8 months ago

同样升级到funasr最新版本测试,问题依然存在

使用vad的代码 from funasr import AutoModel

model = AutoModel(model="damo/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch", model_revision="v2.0.4", vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch", vad_model_revision="v2.0.4",

punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",

            #   punc_model_revision="v2.0.4",
            #   spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
            #   spk_model_revision="v2.0.2"
              )

audio_in='/mnt/d/wsl/wav/18955588888.wav' res = model.generate(input=audio_in, hotword='五五五 八八八') print(res)

推理结果:[{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '幺 八 九 嗯 嗯 嗯 八 八 八 八 八', 'timestamp': [[550, 790], [990, 1230], [1570, 1995], [2890, 3130], [3410, 3650], [3990, 4415], [5250, 5490], [5710, 5950], [6250, 6615], [7470, 7710], [7970, 8210]]}]

五五五识别成了嗯 嗯 嗯

不使用vad的代码 from funasr import AutoModel

model = AutoModel(model="damo/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch", model_revision="v2.0.4",

vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",

            #   vad_model_revision="v2.0.4",
            #   punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
            #   punc_model_revision="v2.0.4",
            #   spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
            #   spk_model_revision="v2.0.2"
              )

audio_in='/mnt/d/wsl/wav/18955588888.wav' res = model.generate(input=audio_in, hotword='五五五 八八八') print(res)

推理结果:[{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '幺 八 九 五 五 五 八 八 八 八 八', 'timestamp': [[510, 750], [990, 1230], [1590, 1830], [2870, 3110], [3430, 3670], [4030, 4270], [5210, 5450], [5710, 5950], [6230, 6470], [7350, 7590], [7830, 8185]]}]

可以看到结果准确无误

wangqin666 commented 8 months ago

但是确实可以通过双声道转为单声道解决