modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
5.88k stars 636 forks source link

运行speech_diarization,分割出的时间戳大于音频时间本身长度 #1189

Open T0L0ve opened 8 months ago

T0L0ve commented 8 months ago

跑speaker_diarization任务的时候发现分割出的时间大于输入音频的最大长度

dinference_diar_pipline = pipeline(
    mode="sond_demo",
    num_workers=0,
    task=Tasks.speaker_diarization,
    diar_model_config="sond.yaml",
    model='damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch',
    model_revision="v1.0.5",
    sv_model="damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch",
    sv_model_revision="v1.2.2",
)
audio_list=[
"../2.wav",
"../spk1.wav",
"../spk2.wav",
"../spk3.wav",
"../spk4.wav",
]

results = inference_diar_pipline(audio_in=audio_list)
print(results)
{'text': 'spk1 [(0.0, 18.8), (55.36, 59.04), (68.16, 74.0), (93.92, 94.8), (95.6, 106.48), (152.88, 154.64), (158.16, 161.28)]\nspk2 [(18.8, 55.36), (58.88, 68.16), (74.0, 91.36), (94.8, 95.6), (106.48, 144.56), (154.64, 158.16)]\nspk3 [(91.28, 94.0)]\nspk4 [(125.84, 128.0), (130.96, 131.68), (144.56, 152.88)]'}

image 共117秒的音频最后分割却能有144.56, 152.88这样的结果

LauraGPT commented 8 months ago

Please raise issues ref to https://github.com/alibaba-damo-academy/FunASR/issues/1073

White-Friday commented 8 months ago

@T0L0ve 请问这个问题解决了吗?我也有类似的问题