modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.49k stars 688 forks source link

流式模型会带上历史信息 #1517

Closed FirstDiscoverer closed 6 months ago

FirstDiscoverer commented 7 months ago

🐛 Bug

流式模型,当第一个语音流预测完成后,最后的is_final=False,然后第二个语音流进行预测,会带上第一个语音流最后未吐出的文字。理论上第二个语音流的cache是新的,不含有第一个cache的内容,即使第一个语音流的is_final=False,也不会带到第二个语音流

To Reproduce

Code sample

import os
from unittest import TestCase

import soundfile
from funasr import AutoModel

class StreamModelTest(TestCase):

    def test(self):
        chunk_size = [0, 10, 5]  # [0, 10, 5] 600ms, [0, 8, 4] 480ms
        encoder_chunk_look_back = 4  # number of chunks to lookback for encoder self-attention
        decoder_chunk_look_back = 1  # number of encoder chunks to lookback for decoder cross-attention

        model = AutoModel(model="paraformer-zh-streaming")

        wav_file = os.path.join(model.model_path, "example/asr_example.wav")
        speech, sample_rate = soundfile.read(wav_file)
        chunk_stride = chunk_size[1] * 960  # 600ms
        speech = speech[: chunk_stride * 4]

        cache = {}
        final_res_1 = ''
        total_chunk_num = int(len((speech) - 1) / chunk_stride + 1)
        for i in range(total_chunk_num):
            speech_chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
            # is_final = i == total_chunk_num - 1
            is_final = False
            res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size,
                                 encoder_chunk_look_back=encoder_chunk_look_back,
                                 decoder_chunk_look_back=decoder_chunk_look_back)
            assert len(res) == 1
            final_res_1 += res[0]['text']
            print(final_res_1)

        print(f"final_res_1: {final_res_1}")

        cache = {}
        final_res_2 = ''
        for i in range(total_chunk_num):
            speech_chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
            # is_final = i == total_chunk_num - 1
            is_final = False
            res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size,
                                 encoder_chunk_look_back=encoder_chunk_look_back,
                                 decoder_chunk_look_back=decoder_chunk_look_back)
            assert len(res) == 1
            final_res_2 += res[0]['text']
            print(final_res_2)
        print(f"final_res_2: {final_res_2}")

Expected behavior

期望输出

final_res_1: 欢迎大家来
final_res_2: 欢迎大家来

实际输出

final_res_1: 欢迎大家来
final_res_2: 体验欢迎大家来

Environment

OS: Ubuntu 22.04.3 LTS x86_64
Python 3.10.13
pip install torch==2.2.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118

funasr==1.0.17
modelscope==1.13.1

model_id = 'iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online'
model_revision = "v2.0.4"
LauraGPT commented 6 months ago

Bug has been fixed. Ref to https://github.com/alibaba-damo-academy/FunASR/issues/1622