modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.18k stars 659 forks source link

实时识别场景出现 vstack expects a non-empty TensorList 报错 #1486

Closed Xiaomingpapapa closed 6 months ago

Xiaomingpapapa commented 6 months ago

🐛 Bug

我们在实时通话的场景接入了 funasr,运行过程中会偶发两个报错:

1. segments = self.model.generate(input=frame.audio_nparray.astype(np.float32), cache=cache, is_final=True, chunk_size=[0, 10, 5], encoder_chunk_look_back=4, decoder_chunk_look_back=1) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 212, in generate return self.inference(input, input_len=input_len, cfg) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 248, in inference res = model.inference(batch, kwargs) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/models/paraformer_streaming/model.py", line 536, in inference speech, speech_lengths = extract_fbank([audio_sample_i], data_type=kwargs.get("data_type", "sound"), File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/utils/load_utils.py", line 110, in extract_fbank data, data_len = frontend(data, data_len, kwargs) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, **kwargs) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/frontends/wav_frontend.py", line 450, in forward feats, featslengths, = self.forward_lfr_cmvn(feats, feats_lengths, is_final, cache=cache) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/frontends/wav_frontend.py", line 385, in forward_lfr_cmvn mat, cache["lfr_splice_cache"][i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n, File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/frontends/wav_frontend.py", line 310, in apply_lfr LFR_outputs = torch.vstack(LFR_inputs) RuntimeError: vstack expects a non-empty TensorList

2. segments = self.model.generate(input=frame.audio_nparray.astype(np.float32), cache=cache, is_final=True, chunk_size=[0, 8, 4], encoder_chunk_look_back=0, decoder_chunk_look_back=0) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 212, in generate return self.inference(input, input_len=input_len, cfg) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 248, in inference res = model.inference(batch, kwargs) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/models/paraformer_streaming/model.py", line 542, in inference tokens_i = self.generate_chunk(speech, speech_lengths, key=key, tokenizer=tokenizer, cache=cache, frontend=frontend, kwargs) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/models/paraformer_streaming/model.py", line 418, in generate_chunk encoder_out, encoder_out_lens = self.encode_chunk(speech, speech_lengths, cache=cache, is_final=kwargs.get("is_final", False)) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/models/paraformer_streaming/model.py", line 167, in encode_chunk encoder_out, encoder_outlens, = self.encoder.forward_chunk(speech, speech_lengths, cache=cache["encoder"]) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/models/scama/encoder.py", line 437, in forward_chunk xs_pad = self.embed(xs_pad, cache) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, **kwargs) File "/data/miniconda3/envs/zhiming_env/lib/python3.9/site-packages/funasr/models/transformer/embedding.py", line 430, in forward batch_size, timesteps, input_dim = x.size() ValueError: not enough values to unpack (expected 3, got 1)

To Reproduce

我们前置使用 vad 将通话的实时语音流进行切分,最终保证送到 funasr 的是一段包含对话的音频帧,调用参数如下: model.generate(input=frame.audio_nparray.astype(np.float32), cache=cache, is_final=True, chunk_size=[0, 8, 4], encoder_chunk_look_back=4, decoder_chunk_look_back=1)

Environment

Xiaomingpapapa commented 6 months ago

funasr_streaming_demo.zip

补充了支持复现的代码,其中包括相关的依赖,以及对应的录音。

在录音快结束时,便会出现上述问题 RuntimeError: vstack expects a non-empty TensorList

LauraGPT commented 6 months ago

抱歉,代码有点复杂,可以参考这个代码实现: https://github.com/alibaba-damo-academy/FunASR/blob/main/runtime/python/websocket/funasr_wss_server.py

Xiaomingpapapa commented 6 months ago

了解,我会参考的。 现在问题是发生在 funasr 模块内部的,且能稳定复现,想知道具体是什么原因导致的这个问题,多谢了

LauraGPT commented 6 months ago

了解,我会参考的。 现在问题是发生在 funasr 模块内部的,且能稳定复现,想知道具体是什么原因导致的这个问题,多谢了

如果你能用文档例子的方式复现问题,我们会去修复。不用集成你额外的代码,我们看不懂。