modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
4.62k stars 514 forks source link

Qwen-Audio + VAD 搭配使用报错 #1728

Open zhangyucha0 opened 1 month ago

zhangyucha0 commented 1 month ago

🐛 Bug

qwen-audio + vad 运行报错

To Reproduce

  1. Run cmd python qwen_demo.py
  2. See error
    
    2024-05-14 11:09:35,110 - modelscope - INFO - PyTorch version 2.3.0 Found.
    2024-05-14 11:09:35,110 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
    2024-05-14 11:09:35,135 - modelscope - INFO - Loading done! Current index file version is 1.14.0, with md5 7f17021ca099dd6760d43c7a9e69c36a and a total number of 976 components indexed
    Detect model requirements, begin to install it: /root/.cache/modelscope/hub/Qwen/Qwen-Audio/requirements.txt
    install model requirements successfully
    WARNING:transformers_modules.Qwen-Audio.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
    WARNING:transformers_modules.Qwen-Audio.modeling_qwen:Try importing flash-attention for faster inference...
    WARNING:transformers_modules.Qwen-Audio.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
    WARNING:transformers_modules.Qwen-Audio.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
    WARNING:transformers_modules.Qwen-Audio.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
    Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 13.09it/s]
    audio_start_id: 155163, audio_end_id: 155164, audio_pad_id: 151851.
    2024-05-14 11:09:42,213 - modelscope - WARNING - Using the master branch is fragile, please use it with caution!
    2024-05-14 11:09:42,213 - modelscope - INFO - Use user-specified model revision: master
    ckpt: /root/.cache/modelscope/hub/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
    rtf_avg: 0.019: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.60it/s]
    0%|                                                                                                                                                                               | 0/1 [00:00<?, ?it/sTraceback (most recent call last):                                                                                                                                                   | 0/1 [00:00<?, ?it/s]
    File "/root/.cache/huggingface/modules/transformers_modules/Qwen-Audio/audio.py", line 91, in load_audio
    out = run(cmd, capture_output=True, check=True).stdout
    File "/root/miniconda3/envs/funasr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['ffmpeg', '-nostdin', '-threads', '0', '-i', 'tensor([-0.0001, -0.0002,  0.0007,  ...,  0.0000,  0.0000,  0.0000])', '-f', 's16le', '-ac', '1', '-acodec', 'pcm_s16le', '-ar', '16000', '-']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "qwen_demo.py", line 18, in res = model.generate(input=audio_in, prompt=prompt, batch_size_s=0,) File "/root/miniconda3/envs/funasr/lib/python3.8/site-packages/funasr/auto/auto_model.py", line 248, in generate return self.inference_with_vad(input, input_len=input_len, cfg) File "/root/miniconda3/envs/funasr/lib/python3.8/site-packages/funasr/auto/auto_model.py", line 394, in inference_with_vad results = self.inference( File "/root/miniconda3/envs/funasr/lib/python3.8/site-packages/funasr/auto/auto_model.py", line 285, in inference res = model.inference(batch, **kwargs) File "/root/miniconda3/envs/funasr/lib/python3.8/site-packages/funasr/models/qwen_audio/model.py", line 66, in inference audio_info = self.tokenizer.process_audio(query) File "/root/.cache/huggingface/modules/transformers_modules/Qwen-Audio/tokenization_qwen.py", line 556, in process_audio audio = load_audio(audio_path) File "/root/.cache/huggingface/modules/transformers_modules/Qwen-Audio/audio.py", line 93, in load_audio raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e RuntimeError: Failed to load audio: ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1) configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared libavutil 56. 31.100 / 56. 31.100 libavcodec 58. 54.100 / 58. 54.100 libavformat 58. 29.100 / 58. 29.100 libavdevice 58. 8.100 / 58. 8.100 libavfilter 7. 57.100 / 7. 57.100 libavresample 4. 0. 0 / 4. 0. 0 libswscale 5. 5.100 / 5. 5.100 libswresample 3. 5.100 / 3. 5.100 libpostproc 55. 5.100 / 55. 5.100 tensor([-0.0001, -0.0002, 0.0007, ..., 0.0000, 0.0000, 0.0000]): No such file or directory

0%| | 0/1 [00:00<?, ?it/s] 0%| | 0/1 [00:00<?, ?it/s]


#### Code sample

`qwen_demo.py`
```python
#!/usr/bin/env python3
# -*- encoding: utf-8 -*-
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
#  MIT License  (https://opensource.org/licenses/MIT)

# To install requirements: pip3 install -U "funasr[llm]"

from funasr import AutoModel

model = AutoModel(model="Qwen-Audio",
        vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch",
        vad_kwargs={"max_single_segment_time": 30000},
        )

audio_in = "asr_example_zh.wav"
prompt = "<|startoftranscription|><|zh|><|transcribe|><|zh|><|notimestamps|><|wo_itn|>"

res = model.generate(input=audio_in, prompt=prompt, batch_size_s=0,)
print(res)

Environment

LauraGPT commented 1 month ago

on going