pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.53k stars 651 forks source link

Any new 1.60 Keyword spotting examples out there now we have torchaudio? #901

Closed StuartIanNaylor closed 3 years ago

StuartIanNaylor commented 4 years ago

I posted on the forum and @ptrblck on the pytorch forum suggested I post here and shout out for @vincentqb

https://discuss.pytorch.org/t/any-new-1-60-keyword-spotting-examples-out-there-now-we-have-torchaudio/95146

Its sort of in the Above but yeah it would be amazing if we could get some examples using pytorchaudio as a streaming KWS now libs like librosa are not needed.

I posted the Linto HMG tool as there is a big shortage of truly opensource Keyword/Hotword systems out there we have various sources freeware, opensource but often the models are blackboxes or producing them is far from easy.

It would be really great to get something like the Linto HMG on Pytorch so its not just a dev community but a whole array of people could see how far they can push the accuracy with certain model types. Keyword/Hotword seems often to be GRU but also CRNN & DS-CNN seems to be getting better results nowdays.

If it could be a concentric project that maybe could be a datum for other additions as the only thing slightly underwhelming in Pytorchaudio is the Sox style VAD rather than the WebRtc_VAD that many use for silence and voice activation rather than trimming wavs.

Also whilst on the topic of VAD & MFCC when we are processing MFCC do we not already have the VAD frequency bins being processed? Is it not possible to combine and cut much much duplication out of running VAD & MFCC at the same time?

Also another Pytorch use is maybe VAD model detection that uses ML to identify voice more accurately against possible media noise?

Anyway if the above could be done, it would be great as seriously I have been searching for a good opensource KWS that has a good opensource model creator for quite a while and don't think I am alone. Also having Pytorchaudio on a RaspberryPi means we can drop Librosa as Numba (JIT) is not the easiest to install on the Pi.

Also once more :) the 64bit OS for both Lite & Desktop is available for the Pi is there any chance of an official wheel? https://downloads.raspberrypi.org/raspios_lite_arm64/images/raspios_lite_arm64-2020-08-24/

Tom on the forum has kindly posted https://discuss.pytorch.org/t/raspberry-arm64-binaries/94751/6?u=rolyan_trauts http://mathinf.com/pytorch/arm64/

I will gladly use Toms but an official wheel would be excellent as you have to be aware of the forum post otherwise.

vincentqb commented 4 years ago

Hi, thanks for reaching out! Let me know if I missed something in your comments. :)

StuartIanNaylor commented 4 years ago

Hi,

In a lot of opensource VoiceAI projects a really good lite KWS is still missing where the process power on likely targets such as Raspberry Pi Cortex A53 is probably going to be high load for a full ASR such as wav2letter. Generally with Keyword spotting the model is much simpler than a phonetic based ASR which makes creating custom models of more accuracy much easier also.

Generally along the lines of the tensorflow example https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/speech_commands But KWS can run in extremely low load conditions. https://github.com/ARM-software/ML-KWS-for-MCU Which is interesting as the often used GRU seems like CRNN & DS-CNN architectures might be more efficient with limited resources.

To bang on about Arm again Librosa seems a total pain to install due to Numba (Jit) which is probably essential. One of the reasons I was like wow Pytorchaudio was optimised audio tools without the need for Librosa. So Vad using Librosa isn't that attractive and even with the Numba Jit its still not hugely optimised or light.

But one thing I have been wondering for a while as MFCC & VAD are analysing spectral frames via FFT routines to get the frequency bins twice in separate routines, whilst that frame spectra could give both VAD & MFCC for little extra overhead in one routine?

Librosa the one we mention is just the python implementation of https://labrosa.ee.columbia.edu/matlab/rastamat/ which is far beyond me :) There is also a really great Julia version. https://github.com/JuliaDSP/MFCC.jl where they use feacalc() for speaker recognition and diarization not sure about standard neural networks but there are a few neural network speaker recognition / diarization repos on github of varying completion.

So yeah far too many comments and branching all over the place :) apols.

Just wondering because it is quite common want nowadays would pytorch give some examples of KWS?

Then sort of spiralled off but really if there is pytorchaudio it would be extremely convenient to use only pytorch audio without need for further audio libs. Its just VAD really the WebRTC style rather than Sox that is missing. If extra functionality like feature based VaD... Should be included is extra but maybe feacalc isn't & should ?

stonelazy commented 3 years ago

@StuartIanNaylor I just wanted to check whether you have solved this streaming featurizer problem ? Or if you have found any reference implementation for the same ?

StuartIanNaylor commented 3 years ago

Its all there in pytorch audio if not using arm, still have dependency problems where intel mkl is enforced which is sort fo annoying