tobiashuttinger / openai-whisper-realtime

A quick experiment to achieve almost realtime transcription using Whisper.
MIT License
186 stars 28 forks source link

its very very slow.. #3

Open yslion opened 1 year ago

yslion commented 1 year ago

rtx3060 4G , speaking finish , text shows after 3-5 minutes what's wrong

tobiashuttinger commented 1 year ago

This could indicate that the very basic voice activity detection isn't working correctly. Try playing with the SILENCE_THRESHOLD and SILENCE_RATIO variables.

Alternatively, there are more sophisticated implementations of this idea now, for example https://github.com/shirayu/whispering

LikeGiver commented 1 year ago

The default params don't work for me too, so I debug and change the params, I find that the SILENCE_RATIO is really important, whether it works or not depends on the noiseness degree of your enviroment and you microphone. So in my situation, I set the SILENCE_RATIO to 1000, so that it won't continue contecate the arrays as it thinks I am still speaking, when I am actually not speaking. The SILENCE_RATIO is also important, and I cut it half to suit me. I also import the pytorch, cause I don't know if the model will recoginize the type of input and process the data dependently, so I change the data type from numpy.array to torch.tensor, in a wish to improve the process speed.

import sounddevice as sd
import numpy as np
import torch

import whisper

import asyncio
import queue
import sys

# SETTINGS
MODEL_TYPE="base"
# the model used for transcription. https://github.com/openai/whisper#available-models-and-languages
LANGUAGE="English"
# pre-set the language to avoid autodetection
BLOCKSIZE=16000
# this is the base chunk size the audio is split into in samples. blocksize / 16000 = chunk length in seconds. 
SILENCE_THRESHOLD=1000
# should be set to the lowest sample amplitude that the speech in the audio material has
SILENCE_RATIO=50
# number of samples in one buffer that are allowed to be higher than threshold

global_ndarray = None
model = whisper.load_model(MODEL_TYPE)

async def inputstream_generator():
    """Generator that yields blocks of input data as NumPy arrays."""
    q_in = asyncio.Queue()
    loop = asyncio.get_event_loop()

    def callback(indata, frame_count, time_info, status):
        loop.call_soon_threadsafe(q_in.put_nowait, (indata.copy(), status))

    stream = sd.InputStream(samplerate=16000, channels=1, dtype='int16', blocksize=BLOCKSIZE, callback=callback)
    with stream:
        while True:
            indata, status = await q_in.get()
            yield indata, status

async def process_audio_buffer():
    global global_ndarray
    async for indata, status in inputstream_generator():

        indata_flattened = abs(indata.flatten())

        # discard buffers that contain mostly silence
        if(np.asarray(np.where(indata_flattened > SILENCE_THRESHOLD)).size < SILENCE_RATIO):
            continue

        if (global_ndarray is not None):
            global_ndarray = np.concatenate((global_ndarray, indata), dtype='int16')
        else:
            global_ndarray = indata

        # concatenate buffers if the end of the current buffer is not silent
        if (np.average((indata_flattened[-100:-1])) > SILENCE_THRESHOLD):
            continue
        else:
            local_ndarray = global_ndarray.copy()
            global_ndarray = None
            indata_transformed = local_ndarray.flatten().astype(np.float32) / 32768.0
            indata_transformed_tensor = torch.tensor(indata_transformed)
            result = model.transcribe(indata_transformed_tensor, language=LANGUAGE)
            if result["text"] != "":
                print(result["text"])

        del local_ndarray
        del indata_flattened

async def main():
    print('\nActivating wire ...\n')
    audio_task = asyncio.create_task(process_audio_buffer())
    while True:
        await asyncio.sleep(1)
    audio_task.cancel()
    try:
        await audio_task
    except asyncio.CancelledError:
        print('\nwire was cancelled')

if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        sys.exit('\nInterrupted by user')

I think the result is okey for me, with a little issue, when I am typing, it will output many "Thank you" cause it thinks I am speaking.

(minigpt4) likegiver@likegiver-OMEN-by-HP-Laptop-17-ck1xxx:~/Desktop/codes/2023_10/whisper$ /home/likegiver/anaconda3/envs/minigpt4/bin/python /home/likegiver/Desktop/codes/2023_10/whisper/openai-whisper-realtime/openai-whisper-realtime.py

Activating wire ...

 Hello, I'm going to give a speech.
 Well, today's Wednesday is...
 really good you know.
 You might
 you are looking for a new place to live near your university.
 You have two choice.
 The first place is in the house, which is
 you with Xiao Wei several others.
 students.
 The second place is the sport apartment.
 which you would not have to see always.
 He's сейчас.
 which you would not have to share with others.
 It's good.
 Thank you.
 Thank you.
 Thank you.
 Thank you.