mwv / vad

Voice Activity Detector
Other
72 stars 13 forks source link

Convert audio signal buffer to ndarray #5

Closed jhoelzl closed 7 years ago

jhoelzl commented 7 years ago

I want to integrate this VAD into the SpeechRecognition Module.

So i have this buffer variable and ant to convert it to a ndarray in order to use it as an input for the VAD:

from vad import VAD
import numpy as np

FS_RATE = 16000

detector = VAD(fs=FS_RATE)
....
buffer = source.stream.read(source.CHUNK)
...

signal = np.fromstring(buffer, dtype=np.int16)
signal = signal.astype(np.float32)

speech = detector.detect_speech(signal, FS_RATE)
....

However, i get an error when using the detect_speech() function:

('Unexpected error:', <type 'exceptions.TypeError'>)

Any suggestions? Thanks!

mwv commented 7 years ago

Can you provide an example array for which this fails?

Also note that detect_speech does not take the sampling rate as a parameter.

jhoelzl commented 7 years ago

Hello @mwv ,

Also note that detect_speech does not take the sampling rate as a parameter.

Okay, then the documentation in your README is not correct.

Here is my example code. I basically extended the listen() function in the SpeechRecognition Module with your VAD Code.


from vad import VAD
import numpy as np

FS_RATE = 16000

detector = VAD(fs=FS_RATE)
....
buffer = source.stream.read(source.CHUNK)
...

signal = np.fromstring(buffer, dtype=np.int16)
signal = signal.astype(np.float32)

print(signal)
print(type(signal))
print(signal.shape)

result = mwv_vad.detect_speech(signal)
...

I get an error when calling the function mwv_vad.detect_speech():

('Unexpected error:', <type 'exceptions.TypeError'>)

The print statements return this information: print(signal):

[ 0. 0. 0. ..., -105. -45. 26.]

print(type(signal)):

<type 'numpy.ndarray'>

print(signal.shape):

(1024,)

Regards, Josef

mwv commented 7 years ago

I updated the README. Thanks for letting me know.

About the signal array: what is the dtype? Can you post the full stacktrace? Could you also add a link to a minimal array for which you get this error? I have not seen this before.

jhoelzl commented 7 years ago

Hello,

when i try the script again with your merged pull request from me, i get this error:

Traceback (most recent call last): File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/my_speechcore/speech_recognition/init.py", line 599, in listen result = mwv_vad.detect_speech(signal,threshold=0.5) File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/vad/_vad.py", line 97, in detect_speech return self.activations(sig, n_noise_frames) > threshold File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/vad/_vad.py", line 139, in activations frame = frames[n] IndexError: index 3 is out of bounds for axis 0 with size 3

signal dtype is np.float32 after conversion:

signal = np.fromstring(buffer, dtype=np.int16)
signal = signal.astype(np.float32)
jhoelzl commented 7 years ago

in function activations() i added some print actions to get inside of the data shapes:

...
frames = self.stft(sig)
n_frames = frames.shape[0]

print(frames.shape)
print(n_noise_frames)

noise_var_tmp = zeros(self.NFFT//2+1)

for n in sm.range(n_noise_frames):
    print(n)
    frame = frames[n]
    noise_var_tmp = noise_var_tmp + (conj(frame) * frame).real
...

Then i got this in the terminal:

(3, 1025) 20 0 1 2 3 Traceback (most recent call last): File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/my_speechcore/speech_recognition/init.py", line 599, in listen result = mwv_vad.detect_speech(signal,threshold=0.5) File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/vad/_vad.py", line 97, in detect_speech return self.activations(sig, n_noise_frames) > threshold File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/vad/_vad.py", line 141, in activations frame = frames[n] IndexError: index 3 is out of bounds for axis 0 with size 3

So it seems to me that the size of the me, the size of the frames variable is not correct?

I pass the signal with shape (1024,) into the function stft() and it returns the fourier transform with shape (3,1025)

mwv commented 7 years ago

Thanks for clarifying. The signal you supply is too short to take a noise imprint. By default, 20 frames at the start of the signal are used to get an initial estimate of the noise. For the standard settings for window size, hop and sampling rate, this approximately corresponds the first second in the signal. You can decrease the number of frames for the noise sample, but you probably just want to pass in a longer signal. Voice activity detection doesn't mean much on a signal of 0.064 seconds. I have added an exception that should make the error you got more clear.