Closed jhoelzl closed 7 years ago
Can you provide an example array for which this fails?
Also note that detect_speech does not take the sampling rate as a parameter.
Hello @mwv ,
Also note that detect_speech does not take the sampling rate as a parameter.
Okay, then the documentation in your README is not correct.
Here is my example code. I basically extended the listen()
function in the SpeechRecognition Module with your VAD Code.
from vad import VAD
import numpy as np
FS_RATE = 16000
detector = VAD(fs=FS_RATE)
....
buffer = source.stream.read(source.CHUNK)
...
signal = np.fromstring(buffer, dtype=np.int16)
signal = signal.astype(np.float32)
print(signal)
print(type(signal))
print(signal.shape)
result = mwv_vad.detect_speech(signal)
...
I get an error when calling the function mwv_vad.detect_speech()
:
('Unexpected error:', <type 'exceptions.TypeError'>)
The print statements return this information:
print(signal)
:
[ 0. 0. 0. ..., -105. -45. 26.]
print(type(signal))
:
<type 'numpy.ndarray'>
print(signal.shape)
:
(1024,)
Regards, Josef
I updated the README. Thanks for letting me know.
About the signal
array: what is the dtype? Can you post the full stacktrace? Could you also add a link to a minimal array for which you get this error? I have not seen this before.
Hello,
when i try the script again with your merged pull request from me, i get this error:
Traceback (most recent call last): File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/my_speechcore/speech_recognition/init.py", line 599, in listen result = mwv_vad.detect_speech(signal,threshold=0.5) File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/vad/_vad.py", line 97, in detect_speech return self.activations(sig, n_noise_frames) > threshold File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/vad/_vad.py", line 139, in activations frame = frames[n] IndexError: index 3 is out of bounds for axis 0 with size 3
signal dtype
is np.float32 after conversion:
signal = np.fromstring(buffer, dtype=np.int16)
signal = signal.astype(np.float32)
in function activations()
i added some print actions to get inside of the data shapes:
...
frames = self.stft(sig)
n_frames = frames.shape[0]
print(frames.shape)
print(n_noise_frames)
noise_var_tmp = zeros(self.NFFT//2+1)
for n in sm.range(n_noise_frames):
print(n)
frame = frames[n]
noise_var_tmp = noise_var_tmp + (conj(frame) * frame).real
...
Then i got this in the terminal:
(3, 1025) 20 0 1 2 3 Traceback (most recent call last): File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/my_speechcore/speech_recognition/init.py", line 599, in listen result = mwv_vad.detect_speech(signal,threshold=0.5) File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/vad/_vad.py", line 97, in detect_speech return self.activations(sig, n_noise_frames) > threshold File "/media/myuser/data/slt/venv/local/lib/python2.7/site-packages/vad/_vad.py", line 141, in activations frame = frames[n] IndexError: index 3 is out of bounds for axis 0 with size 3
So it seems to me that the size of the me, the size of the frames
variable is not correct?
I pass the signal
with shape (1024,) into the function stft()
and it returns the fourier transform with shape (3,1025)
Thanks for clarifying. The signal you supply is too short to take a noise imprint. By default, 20 frames at the start of the signal are used to get an initial estimate of the noise. For the standard settings for window size, hop and sampling rate, this approximately corresponds the first second in the signal. You can decrease the number of frames for the noise sample, but you probably just want to pass in a longer signal. Voice activity detection doesn't mean much on a signal of 0.064 seconds. I have added an exception that should make the error you got more clear.
I want to integrate this VAD into the SpeechRecognition Module.
So i have this buffer variable and ant to convert it to a ndarray in order to use it as an input for the VAD:
However, i get an error when using the
detect_speech()
function:Any suggestions? Thanks!