说话人ASR模型准确度问题（说话人识别上）

lanyuer commented 7 months ago

参考文档： https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc-spk_asr_nat-zh-cn/summary

版本： funasr 0.8.6

代码： `from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks

audio_in = 'wangfang.wav' output_dir = "./results" inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model='damo/speech_paraformer-large-vad-punc-spk_asr_nat-zh-cn', model_revision='v0.0.2', vad_model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch', punc_model='damo/punc_ct-transformer_cn-en-common-vocab471067-large', output_dir=output_dir, ) rec_result = inference_pipeline(audio_in=audio_in, batch_size_token=5000, batch_size_token_threshold_s=40, max_single_segment_time=10000) print(rec_result)

for x in [(x['spk'], x['text'], f'{x["start"]}-{x["end"]}') for x in rec_result['sentences']]: print(x) `

结果： (0, '来来来介绍一下啊，', '900-2120') (0, '这是我大姨这个请问您贵庚了，', '2120-5440') (0, '贵庚八十。', '5440-6580') (0, '哇，', '6580-7180') (0, '这是我老妈，', '7180-8600') (0, '请问您贵庚啊，', '8600-10410') (0, '七十二，', '10410-11450') (0, '哎呀，', '11450-12150') (0, '这是我老爸啊，', '12150-13150') (0, '六哦哥啊，', '13150-15650') (0, '快八十了，', '15650-17075') (0, '这我们家焖面，', '17075-19260') (0, '我妈说吃太简单了，', '19260-20980') (0, '这多好啊。', '20980-22180') (0, '然后注意啊，', '22180-23460') (0, '一定要就蒜，', '23460-24320') (0, '一定要就蒜啊，', '24320-25500')

这里面实际情况是有4个说话人（3女/1男），其中第3/7/11句都不是默认说话人全部都没有识别出来

音频放在附件上了

wf.zip

wanghuii1 commented 7 months ago

There are several limitations to speaker recognition currently in the pipeline. It may not perform well when the audio duration is too short (less than 60 seconds) or when the number of speakers is too large (more than 10). It cannot address the issue of overlapped speech. So it is recommended to try longer audio. @lanyuer

luohao123 commented 7 months ago

Just try pyannoate.audio

plmsmile commented 1 week ago

同，效果不好，全是0。

modelscope / FunASR

说话人ASR模型准确度问题（说话人识别上） #1132