lkuza2 / java-speech-api

The J.A.R.V.I.S. Speech API is designed to be simple and efficient, using the speech engines created by Google to provide functionality for parts of the API. Essentially, it is an API written in Java, including a recognizer, synthesizer, and a microphone capture utility. The project uses Google services for the synthesizer and recognizer. While this requires an Internet connection, it provides a complete, modern, and fully functional speech API in Java.
GNU General Public License v3.0
531 stars 304 forks source link

bytesToDoubleArray() sizing & FFT #100

Open nalbion opened 6 years ago

nalbion commented 6 years ago

As per the recommendations of Moattar and Homayounpour I'm trying to detect voice activity using a 10ms sliding window.

For 10ms of 16kHz 16bit mono audio, getNumBytes(.01) returns 320. (it would be 320.5, but it is stored in an int)

...why add the .5?

       public int getNumOfBytes(double seconds) {
        AudioFormat format = getAudioFormat();
        return (int)(seconds * format.getSampleRate() * format.getFrameSize() + .5);
    }

then getFrequency() calls bytesToDoubleArray(), passing the 320 bytes. Another point of confusion is the calculation of the size of micBufferData:

            double[] micBufferData = new double[bytesRecorded - bytesPerSample +1];
        for (int index = 0, floatIndex = 0; index < bytesRecorded - bytesPerSample + 1; index += bytesPerSample, floatIndex++) {

                 ...
                 micBufferData[floatIndex] = sample32;
            }

with 2 bytesPerSample, the code has allocated space for 319 doubles, but when it's done everything after bytesPerSample[159] is 0.0

back in getFrequency() I end up with an array of 319 Complex values, but again, everything after 159 is 0.0, 0.0

In FFT() you check:

        // radix 2 Cooley-Tukey FFT
        if (N % 2 != 0) { throw new RuntimeException("N is not a power of 2"); }

...At first I thought "that's not checking if it is a power of 2", but then you call it recursively, this would eventually be a valid test. As it happens, the excheption is thrown the first time through because I've got 160 values in an array with capacity for 319.

nalbion commented 6 years ago

I've changed my window size to 8ms and removed the "+1" mentioned above, but now when FFT returns the first element always a 0.0 imaginary component, and as a result findMaxMagnitude() finds a huge value at index 0 and votes it as the top result - so the frequency is always 0 and my VAD never detects any speech

goxr3plus commented 5 years ago

Open issues here https://github.com/goxr3plus/java-google-speech-api