modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.19k stars 659 forks source link

粤语识别模型推理出错,是否有长音频的模型 #1540

Open WuerLei opened 6 months ago

WuerLei commented 6 months ago

系统:ubuntu22.04 版本信息: funasr==1.0.18,modelscope==1.11.1

推理代码: from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks

inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model='damo/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online',model_revision='v2.0.4', vad_model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch', vad_model_revision="v2.0.4", vad_kwargs={"max_single_segment_time": 60000}, punc_model='iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', punc_model_revision="v2.0.4", )

rec_result = inference_pipeline(input='./0325.wav') print(rec_result[0])

问题:0325.wav该音频时长4分钟,推理出错,取前10s钟能正常推理

错误信息:

2024-03-25 17:45:42,026 - modelscope - INFO - PyTorch version 2.1.2 Found. 2024-03-25 17:45:42,027 - modelscope - INFO - Loading ast index from /home/fresh/.cache/modelscope/ast_indexer 2024-03-25 17:45:42,082 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 9271928ad57a76e3f712e4e1331c1640 and a total number of 956 components indexed 2024-03-25 17:45:44,582 - modelscope - INFO - Use user-specified model revision: v2.0.4 2024-03-25 17:45:44,877 - modelscope - INFO - initiate model from /home/fresh/.cache/modelscope/hub/damo/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online 2024-03-25 17:45:44,878 - modelscope - INFO - initiate model from location /home/fresh/.cache/modelscope/hub/damo/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online. 2024-03-25 17:45:44,879 - modelscope - INFO - initialize model from /home/fresh/.cache/modelscope/hub/damo/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online Notice: If you want to use whisper, please pip install -U openai-whisper ckpt: /home/fresh/.cache/modelscope/hub/damo/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online/model.pt 2024-03-25 17:45:55,775 - modelscope - INFO - Use user-specified model revision: v2.0.4 ckpt: /home/fresh/.cache/modelscope/hub/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt 2024-03-25 17:45:56,654 - modelscope - INFO - Use user-specified model revision: v2.0.4 ckpt: /home/fresh/.cache/modelscope/hub/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/model.pt 2024-03-25 17:45:59,121 - modelscope - WARNING - No preprocessor field found in cfg. 2024-03-25 17:45:59,122 - modelscope - WARNING - No val key and type key found in preprocessor domain of configuration.json file. 2024-03-25 17:45:59,122 - modelscope - WARNING - Cannot find available config to build preprocessor at mode inference, current config: {'model_dir': '/home/fresh/.cache/modelscope/hub/damo/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online'}. trying to build by task and model information. 2024-03-25 17:45:59,122 - modelscope - WARNING - No preprocessor key ('funasr', 'auto-speech-recognition') found in PREPROCESSOR_MAP, skip building preprocessor. rtf_avg: 2.026: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.31s/it] 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last): | 0/52 [00:00<?, ?it/s] File "infer_asr.py", line 12, in rec_result = inference_pipeline(input='./0325.wav') File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/modelscope/pipelines/audio/funasr_pipeline.py", line 73, in call output = self.model(*args, kwargs) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/modelscope/models/base/base_model.py", line 35, in call return self.postprocess(self.forward(*args, *kwargs)) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/modelscope/models/audio/funasr/model.py", line 61, in forward output = self.model.generate(args, kwargs) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/funasr/auto/auto_model.py", line 225, in generate return self.inference_with_vad(input, input_len=input_len, cfg) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/funasr/auto/auto_model.py", line 349, in inference_with_vad results = self.inference(speech_j, input_len=None, model=model, kwargs=kwargs, cfg) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/funasr/auto/auto_model.py", line 258, in inference res = model.inference(batch, kwargs) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/funasr/models/uniasr/model.py", line 916, in inference nbest_hyps = self.beam_search( File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/funasr/models/uniasr/beam_search.py", line 402, in forward best = self.search(running_hyps, x, x_mask=mask_enc, pre_acoustic_embeds=pre_acoustic_embeds_cur) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/funasr/models/uniasr/beam_search.py", line 306, in search scores, states = self.score_full(hyp, x, x_mask=x_mask, pre_acoustic_embeds=pre_acoustic_embeds) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/funasr/models/uniasr/beam_search.py", line 176, in score_full scores[k], states[k] = d.score(hyp.yseq, hyp.states[k], x, x_mask=x_mask, pre_acoustic_embeds=pre_acoustic_embeds) File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/funasr/models/scama/decoder.py", line 400, in score logp, state = self.forward_one_step( File "/home/fresh/miniconda3/envs/modelscope/lib/python3.8/site-packages/funasr/models/scama/decoder.py", line 434, in forward_one_step x = torch.cat((x, pre_acoustic_embeds), dim=-1) RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 1 but got size 52 for tensor number 1 in the list. 0%| | 0/52 [00:02<?, ?it/s] 0%|

kexul commented 5 months ago

我用这个模型,10s的句子也会出错