modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
5.92k stars 642 forks source link

当处理长度为1个采样点的音频时,load_audio_text_image_video 函数存在bug #1970

Open viewlei opened 1 month ago

viewlei commented 1 month ago

Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)

🐛 Bug

我在用 paraformer-zh-large-stream 模型对一批音频进行 实时语音识别(流式),以下是我用的代码(按照modelscope上推荐的模板):

# From https://www.modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online
from funasr import AutoModel

chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4")

import soundfile
import os

wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
    print(res)

它在大部分音频上都正常工作,但是在部分音频上会报错,报错信息:

  File "/lib/python3.8/site-packages/funasr/models/paraformer_streaming/model.py", line 600, in inference
    audio_sample = torch.cat((cache["prev_samples"], audio_sample_list[0]))
RuntimeError: zero-dimensional tensor (at position 1) cannot be concatenated

这条16K单通道音频具有 67201 个采样点, main函数中将它切分了8次,前7次切片长度为 9600 ,第8次切片长度为 1.

切片后的样本由 funasr/utils/load_utils.py 中 load_load_audio_text_image_video 函数加载为 tensor

data_or_path_or_list = torch.from_numpy(data_or_path_or_list).squeeze()  # [n_samples,]

这行代码在大部分场景中都按照预期工作,但是当输入的 data_or_path_or_list 长度为1时,此时转成的 tensor 经过 squeeze()后维度消失,引发了这个错误。

对于这条音频,8次调用 load_load_audio_text_image_video 函数得到的tensor形状如下: image

To Reproduce

这些信息可能对复现有帮助: torch - 2.3.1 funasr - 1.1.4 numpy - 1.24.4

所执行的代码、加载的模型、底层依赖的第三方包都已在上述给出,为了复现这个问题,需要额外做以下操作: 1)准备一个采样点个数为 67201的 16K 单通道 wave 格式音频;

# 67202
sox -n -b 16 -r 16000 output.wav synth 4.2001 sine 400
# 67201
sox -n -b 16 -r 16000 output.wav synth 4.20005 sine 400
# 67200
sox -n -b 16 -r 16000 output.wav synth 4.200005 sine 400

2)按照上述的主脚本做一次流式语音识别

对于这个测试用例,这样的代码可以通过测试:

- data_or_path_or_list = torch.from_numpy(data_or_path_or_list).squeeze()  # [n_samples,]
+ data_or_path_or_list = torch.from_numpy(data_or_path_or_list)  # [n_samples,]

由于对 funasr 了解不够全面,无法针对 load_load_audio_text_image_video 函数给出覆盖全面的测试用例,因此没有提pr;

LauraGPT commented 1 month ago

Thanks your feedback. We would check it.

LauraGPT commented 1 month ago

I have tested it, your suggestion is fine. Thanks! Bugfix: https://github.com/modelscope/FunASR/commit/a28de72b17105e952f226f0460be3671883a75a2