Open yslion opened 1 year ago
This could indicate that the very basic voice activity detection isn't working correctly. Try playing with the SILENCE_THRESHOLD
and SILENCE_RATIO
variables.
Alternatively, there are more sophisticated implementations of this idea now, for example https://github.com/shirayu/whispering
The default params don't work for me too, so I debug and change the params, I find that the SILENCE_RATIO is really important, whether it works or not depends on the noiseness degree of your enviroment and you microphone. So in my situation, I set the SILENCE_RATIO to 1000, so that it won't continue contecate the arrays as it thinks I am still speaking, when I am actually not speaking. The SILENCE_RATIO is also important, and I cut it half to suit me. I also import the pytorch, cause I don't know if the model will recoginize the type of input and process the data dependently, so I change the data type from numpy.array to torch.tensor, in a wish to improve the process speed.
import sounddevice as sd
import numpy as np
import torch
import whisper
import asyncio
import queue
import sys
# SETTINGS
MODEL_TYPE="base"
# the model used for transcription. https://github.com/openai/whisper#available-models-and-languages
LANGUAGE="English"
# pre-set the language to avoid autodetection
BLOCKSIZE=16000
# this is the base chunk size the audio is split into in samples. blocksize / 16000 = chunk length in seconds.
SILENCE_THRESHOLD=1000
# should be set to the lowest sample amplitude that the speech in the audio material has
SILENCE_RATIO=50
# number of samples in one buffer that are allowed to be higher than threshold
global_ndarray = None
model = whisper.load_model(MODEL_TYPE)
async def inputstream_generator():
"""Generator that yields blocks of input data as NumPy arrays."""
q_in = asyncio.Queue()
loop = asyncio.get_event_loop()
def callback(indata, frame_count, time_info, status):
loop.call_soon_threadsafe(q_in.put_nowait, (indata.copy(), status))
stream = sd.InputStream(samplerate=16000, channels=1, dtype='int16', blocksize=BLOCKSIZE, callback=callback)
with stream:
while True:
indata, status = await q_in.get()
yield indata, status
async def process_audio_buffer():
global global_ndarray
async for indata, status in inputstream_generator():
indata_flattened = abs(indata.flatten())
# discard buffers that contain mostly silence
if(np.asarray(np.where(indata_flattened > SILENCE_THRESHOLD)).size < SILENCE_RATIO):
continue
if (global_ndarray is not None):
global_ndarray = np.concatenate((global_ndarray, indata), dtype='int16')
else:
global_ndarray = indata
# concatenate buffers if the end of the current buffer is not silent
if (np.average((indata_flattened[-100:-1])) > SILENCE_THRESHOLD):
continue
else:
local_ndarray = global_ndarray.copy()
global_ndarray = None
indata_transformed = local_ndarray.flatten().astype(np.float32) / 32768.0
indata_transformed_tensor = torch.tensor(indata_transformed)
result = model.transcribe(indata_transformed_tensor, language=LANGUAGE)
if result["text"] != "":
print(result["text"])
del local_ndarray
del indata_flattened
async def main():
print('\nActivating wire ...\n')
audio_task = asyncio.create_task(process_audio_buffer())
while True:
await asyncio.sleep(1)
audio_task.cancel()
try:
await audio_task
except asyncio.CancelledError:
print('\nwire was cancelled')
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
sys.exit('\nInterrupted by user')
I think the result is okey for me, with a little issue, when I am typing, it will output many "Thank you" cause it thinks I am speaking.
(minigpt4) likegiver@likegiver-OMEN-by-HP-Laptop-17-ck1xxx:~/Desktop/codes/2023_10/whisper$ /home/likegiver/anaconda3/envs/minigpt4/bin/python /home/likegiver/Desktop/codes/2023_10/whisper/openai-whisper-realtime/openai-whisper-realtime.py
Activating wire ...
Hello, I'm going to give a speech.
Well, today's Wednesday is...
really good you know.
You might
you are looking for a new place to live near your university.
You have two choice.
The first place is in the house, which is
you with Xiao Wei several others.
students.
The second place is the sport apartment.
which you would not have to see always.
He's сейчас.
which you would not have to share with others.
It's good.
Thank you.
Thank you.
Thank you.
Thank you.
rtx3060 4G , speaking finish , text shows after 3-5 minutes what's wrong