nyumaya / nyumaya_audio_recognition

Classify audio with neural nets on embedded systems like the Raspberry Pi
https://nyumaya.com
Apache License 2.0
82 stars 14 forks source link

Complete far field processing chain #11

Closed jonsmirl closed 5 years ago

jonsmirl commented 5 years ago

What is a good sequence of processing for far field voice algorithms? When should you do VAD, AEC, DOA, beamforming, etc?

Here is example code that uses webrtc audio to do AEC, AGC, ANC. Webrtc audio can also do beamforming if you give it the vector of arrival. https://github.com/shichaog/WebRTC-audio-processing

webrtc audio also supports VAD. https://github.com/dpirch/libfvad

I haven't located a good library for DOA. Everyone seems to be using GCC PHAT.

So my working theory is this sequence: 1) VAD - one channel, or should run continuous KWD? 2) KWD - one channel 3) DOA -- in parallel with KWD? 4) after KWD 5) Beamforming using DOA, AEC, AGC, ANC - all channels

But can KWD be done in the presence of audio activity (music playing)? Does AEC need to happen before KWD?

yodakohl commented 5 years ago

First of all, a microphone with a good SNR helps most for far-field applications.

AEC would definitely be a big improvement, especially when playing music on a smart speaker. But it's pretty heavy on the CPU. Getting AEC to work can be a bit difficult. One can let PulseAudio handle AEC but this is not as simple as using Alsa and comes with its own problems. When using Alsa the AEC has to be contained within your application. So the app has to handle all audio output. It won't work if you detect a keyword in a python app and run a music file from the command line. I think having a hardware AEC will work best without too much hassle.

Does AEC need to happen before KWD => Yes

VAD could reduce energy consumption and should be easy to implement. But you will need to have enough spare CPU to handle keyword detection anyways.

But can KWD be done in the presence of audio activity (music playing)? It depends on the distance between the music source and the speaker. In the smart speaker case where the mic sits within centimeters of the speaker the recognition rate will be dramatically reduced. If it's possible locating mic and speaker far apart will be an easy gain in performance.

I think implementing DOA Beamforming AEC, AGN, ANC will all only give a small improvement each. This comes at the cost of increased complexity, development and maintenance effort.

The current trend in machine learning seems to move toward end-to-end architectures. I think the future voice recognition system will be a neural net with a multichannel microphone input and one input for the current audio output. This would give the network all necessary information to learn Beamforming, AEC, AGC, and ANC on its own.

jonsmirl commented 5 years ago

The AEC code in WebRTC is supposed to quite good. I will check it out in the next few days. Pulse is using WebRTC-audio internally now (module-echo-cancel).

Another of the Respeaker boards is also interesting. The 6-mic board has 8 channel input, the other two inputs are connected to the RaspPi audio output. No DSP, but the audio out is fed back via hardware. http://wiki.seeedstudio.com/ReSpeaker_6-Mic_Circular_Array_kit_for_Raspberry_Pi/

https://arunraghavan.net/2016/05/improvements-to-pulseaudios-echo-cancellation/ https://arunraghavan.net/2016/06/beamforming-in-pulseaudio/

jonsmirl commented 5 years ago

How does DOA work in a noisy environment? For example a TV is playing in the background and someone says the keyword. How do you process the audio to ensure that you get the DOA of the keyword and not the DOA of the TV noise?

yodakohl commented 5 years ago

I'm closing this since this kind of discussion is not an issue. I created a gitter-chat for discussions which are not issues.