实时语音识别和VAD效果不好

liurongjie174 commented 5 months ago

Notice: In order to resolve issues more efficiently, please raise issue following the template. （注意：为了更加高效率解决您遇到的问题，请按照模板提问，补充细节）

❓ Questions and Help

我用FunASR识别实时语音，由于那边推过来的流是通过WS推送PCM，每个包大小是234，然后用示例的funasr_wss_server.py去识别，vad和online效果不好。首先vad经常识别到的内容为[],导致fun_asr_online慢，然后fun_asr也执行的很慢，所以实时识别的数据推送出来的特别慢。

Before asking:

search the issues.
search the docs.

What is your question?

Code

What have you tried?

What's your environment?

OS (e.g., Linux):
FunASR Version (e.g., 1.0.0):
ModelScope Version (e.g., 1.11.0):
PyTorch Version (e.g., 2.0.0):
How you installed funasr (pip, source):
Python version:
GPU (e.g., V100M32)
CUDA/cuDNN version (e.g., cuda11.7):
Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
Any other relevant information:

LauraGPT commented 5 months ago

可以把包size弄大一些，例如，100ms一次

liurongjie174 commented 4 months ago

谢谢你的回答。我已经每次接收3s左右的包，大约有30000多二进制数量的包，再进行async_vad识别，但是依然返回很慢。查看了async_vad源码,利用model_vad进行generate获取segments_result,当这个返回值的数据长度为1时,并且里面的内容start和end不为0时才会进行后续的在线或者离线识别。但是我多次测试，发现要满足segments_result,当这个返回值的数据长度为1时,并且里面的内容start和end不为0很难，或者需要等待很久的时间，例如一段话全部说话，可能是1-2分钟左右。我现在对segments_result返回值代表的含义也是不理解的。烦请解开我的疑惑？ 1.segments_result返回值代表的含义； 2.segments_result,当这个返回值的数据长度为1时,并且里面的内容start和end不为0要等待很久的原因。

AMAG-AB commented 4 months ago

实时确实很难使用

modelscope / FunASR