synesthesiam / voice2json

Command-line tools for speech and intent recognition on Linux
MIT License
1.09k stars 63 forks source link

Ideas #13

Open StuartIanNaylor opened 4 years ago

StuartIanNaylor commented 4 years ago

@synesthesiam Once again your simple genius is on display.

You posted in http://voice2json.org/#ideas

Its basically datasets being formated and available for KWS as there are huge datasets for ASR.

Linto is probably a good place to look for KWS as they are doing some interesting things and showing much insight in providing solutions for existing technology platforms. Its some of the smaller repos such as https://github.com/linto-ai/gpu-ne10-mfcc where they have looked at load heavy process and optimised for neon.

Dataset wise https://github.com/linto-ai/linto-desktoptools-hmg is an exceptionally good idea but they have even opensourced thier website for dataset capture. https://github.com/linto-ai/linto-wakemeup https://github.com/linto-ai/linto-wuw-booth

https://github.com/linto-ai/linto-desktoptools-voxharvest Is another example how they also recognise that the dataset is the code of the model and seem very determined to make datasets opensource and available and easily accessable.

I think for KWS https://github.com/linto-ai/linto-command-module is definately worth a look because its opensource, obviously has talented developers like yourself and is likely to stay that way.

But maybe just use PyRTSTools - Custom tools for realtime speech processing if thier choice of ancillary tools is not to taste.

Tenacity - General-purpose retrying library paho-mqtt - MQTT client library.

I presume there KWS will or already has much of the Ne10 optimisation they gave their MFCC module, but the big draw for me is thier focus on homesteading the dataset.

I am going to bang on about my usual project of audio processing and how it has a natural cost effective distribution with extremely common and available Socs. Multiple sattelites can feed a server that feed a mixer based on peak VAD threshold of the period of time for the current KW. I really like the idea of snapcast for this as its so simple but latency compensated via time code. Only the satelites that recognised KW of VAD threshold would forward to a mixer and its extremely lightweight.

I always mention this repo https://github.com/voice-engine/ec because dunno why but alsa-plugins speexdsp and pulseaudio-webrtc AEC I can not get to work as effectively. Much confusion is via badly recommended hardware that has playback and capture on different clocks when 2x I2S mics of extremely low cost and 5 wires to a Pi Gpio means it shares the 3.5mm clock and is synced. As graciously provided by Adafruit https://learn.adafruit.com/adafruit-i2s-mems-microphone-breakout/raspberry-pi-wiring-test Software AEC is effective and possible with a starting level of approx Cortex A53/A35 level or above. There is huge opportunity to create further audio processing as wav is processed to MFCC image. Such as by raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT (Discrete Cosine Transform), which reduces the influence of low-energy components or commonly known as noise reduction and its load is negliable as its part of an existing process. But again some really simple audio processing such as bandpass or highpass here can be included as part of the MFCC creation as its the bins you select of that process.. There is already audio processing such as VAD that is only processing a subsection and then passing on the wav to be processed again to final MFCC. I am thinking much of this can be amalgamated with optional selectable additional filtering routines.

I also feel with the introduction of Cloud TPU at low cost interms of models this brings really interesting possibilities to an initial training capture period and custom own voice model creation for all who do not have access to cpu power, accelerators the skill set or even the will. Its just an intent and a minimum donation.

Acoustic Models From Audiobooks

There are many transcribed datasets that are relatively huge the only problem for KWS is that they are ASR datasets and sentences and not KWS datasets where the focus is the word.

You can grab any dataset run it through an ASR and it will give you start and duration of word positions that can be validated against the original ASR tanscript. As in deepspeech --json as its that simple and sure Kaldi does something similar but generally Kaldi confuses the hell out of me. If you run deepspeech on a datasets wavs with --json it will output a json of word occourance for that wav.

There are many datasets available that just need formating and presenting correctly for a querable parameterised dataset return that is a collection of server based folders, custom user submissions glued together as a dataset via overlayfs. https://openslr.org/83/

Nation-Gender-Region-Volume is the top down hierachy of how the stored global dataset can be querried and the hierachy is pretty obvious apart from volume which is just a method to creates limits on data items returned probably based on word usage commonality and preference of model size.

Basically Linto are doing a Mozilla Common voice but not being so bat shit crazy as providing world class datset capture solutions and zero dataset creation tools apart from the whole dataset in one huge ugly tar. Mozilla do ASR datasets we need KWS and the difference is just sentence to word.

https://openslr.org/contributions.html

https://voice.mozilla.org/en/datasets http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ https://catalog.ldc.upenn.edu/LDC2002T43 https://catalog.ldc.upenn.edu/LDC97S42 https://www.openslr.org/12 https://www.openslr.org/51 https://github.com/Jakobovski/free-spoken-digit-dataset https://www.idiap.ch/dataset/ted https://www.kaggle.com/primaryobjects/voicegender https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/ http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME5/download.html https://arxiv.org/abs/1903.11269 http://projects.ael.uni-tuebingen.de/backbone/moodle/ http://en.arabicspeechcorpus.com/ https://mirjamernestus.nl/Ernestus/NCCFr/index.php https://nats.gitlab.io/swc/

Sentence splitting, resampling and wav window padding is not that hard and the available data is pretty huge. With a database driven overlayFS it should be possible to create solutions for all that is extemely comprehensive and query specific.

StuartIanNaylor commented 4 years ago

@synesthesiam

Micheal have you had a look at https://github.com/JuliaDSP/MFCC.jl

I am going to copy and paste the first read.me as seriously you have to.

MFCC Build Status

A package to compute Mel Frequency Cepstral Coefficients.

The essential routine is re-coded from Dan Ellis's rastamat package, and parameters are named similarly.

Please note that the feature-vector array consists of a vertical stacking of row-vector features. This is consistent with the sense of direction of, e.g., Base.cov(), but inconsistent with, e.g., DSP.spectrogram() or Clustering.kmeans().

mfcc() has many parameters, but most of these are set to defaults that should mimick HTK default parameter (not thoroughly tested).

Feature extraction main routine mfcc(x::Vector, sr=16000.0, defaults::Symbol; args...) Extract MFCC features from the audio data in x, using parameter settings characterized by defaults

:rasta: defaults according to Dan Ellis' Rastamat package :htk: defaults mimicking defaults of HTK (unverified) :nbspeaker: narrow-band speaker recognition :wbspeaker: wide-band speaker recognition The actual routine for MFCC computation has many parameters, these are basically the same parameters as in Dan Ellis's rastamat package.

mfcc(x::Vector, sr=16000.0; wintime=0.025, steptime=0.01, numcep=13, lifterexp=-22, sumpower=false, preemph=0.97, dither=false, minfreq=0.0, maxfreq=sr/2, nbands=20, bwidth=1.0, dcttype=3, fbtype=:htkmel, usecmp=false, modelorder=0) This is the main routine computing MFCCs. x should be a 1D vector of FloatingPoint samples of speech, sampled at a frequency of sr. Every steptime seconds, a frame of duration wintime is analysed. The log energy in a filterbank of nbands bins is computed, and a cepstral (discrete cosine transform) representaion is made, keeping only the first numcep coefficients (including log energy). The result is a tuple of three values:

a matrix of numcep columns with for each speech frame a row of MFCC coefficients the power spectrum computed with DSP.spectrogram() from which the MFCCs are computed a dictionary containing information about the parameters used for extracting the features. Pre-set feature extraction applications We have defined a couple of standard sets of parameters that should function well for particular applications in speech technology. They are accessible through the higher level function feacalc(). The top-level interface for calculating features is

feacalc(wavfile::AbstractString, application::Symbol; kwargs...) This will compute speech features suitable for a specific application, which currently can be one of:

:nbspeaker: narrowband (telephone speech) speaker recognition: 19 MFCCs + log energy, delta's, energy-based speech activity detection, feature warping (399 samples) :wbspeaker: wideband speaker recognition: same as above but with wideband MFCC extraction parameters :language: narrowband language recognition: Shifted Delta Cepstra, energy-based speech activity detection, feature warping (299 samples) :diarization: 13 MFCCs, utterance mean and variance normalization The kwargs... parameters allow for various options in file format, feature augmentation, speech activity detection and MFCC parameter settings. They trickle down to versions of feacalc() and mfcc() allow for more detailed specification of these parameters.

feacalc() returns a tuple of three structures:

an Array of features, one row per frame a Dict with metadata about the speech (length, SAD selected frames, etc.) a Dict with the MFCC parameters used for feature extraction More fine-grained control of feacalc() feacalc(wavfile::AbstractString; method=:wav, kwargs...) This function reads an audio file from disk and represents the audio as an Array, and then runs the feature extraction.

The method parameter determines what method is used for reading in the audio file:

:wav: use Julia's native WAV library to read RIFF/WAVE .wav files :sox: use external sox program for figuring out the audio file format and converting to native represantation :sph: use external w_decode program to deal with (compressed) NIST sphere files feacalc(x::Array; chan=:mono, augtype=:ddelta, normtype=:warp, sadtype=:energy, dynrange::Real=30., nwarp::Int=399, sr::AbstractFloat=8000.0, source=":array", defaults=:nbspeaker, mfccargs...) The chan parameter specifies for which channel in the audio file you want features. Possible values are:

:mono: average all channels :a, :b, ...: Use the first (left), second (right), ... channel c::Int: Use the cth channel The augtype parameter specifies how the speech features are augmented. This can be:

:none for no additional features :delta for 1st order derivatives :ddelta for first and second order derivatives :sdc for replacement od the MFCCs with shifted delta cepstra with the default parameters n, d, p, k = 7, 1, 3, 7 The normtype parameter specifies how the features are normalized after extraction

:none for no normalization :warp for short-time Gaussianization using nwarp frames, see warp() below :mvn for mean and variance normalization, see znorm() below The sad parameter controls if Speech Activity Detection is carried out on the features, filtering out frames that do not contain speech

:none: apply no SAD :energy: apply energy based SAD, omitting frames with an energy less than dynrange below the maximum energy of the file. The various applications actually have somewhat different parameter settings for the basic MFCC feature extraction, see the defaults parameter of mfcc() below.

Feature warping, or short-time Gaussianization (Jason Pelecanos) warp(x::Matrix, w=399) This tansforms columns of x by short-time Gaussianization. Each value in the middle of w rows is replaced with its normal deviate (the quantile function of the normal distribution) based on its rank within the w values. The result has the same dimensions as x, but the values are chosen from a discrete set of w normal deviates.

znorm(x::Matrix) znorm!(x::Matrix) This normalizes the data x on a column-by-column basis by an affine transformation, making the per-column mean 0 and variance 1.

Short-term mean and variance normalization As an alternative to short time Gaussianization, and similar to znorm(), you can compute the znorm for a sample in the centre of a sliding window of width w, where mean and variance are computed just over that window using

stmvn(x::Matrix, w=399) Derivatives Derivative of features, fitted over width consecutive frames:

deltas(x::Matrix, width::Int) The derivatives are computed over columns individually, and before the derivatives are computed the data is padded with repeats of the first and last rows. The resulting matrix has the same size as x. deltas() can be applied multiple times in order to get higher order derivatives.

Shifted-Delta-Cepstra SDCs are features used for spoken language recognition, derived from typically MFCCs

sdc(x::Matrix, n::Int=7, d::Int=1, p::Int=3, k::Int=7) This function expands (MFCC) features in x by computing derivatives over 2d+1 consecutive frames for the first n columns of x, stacking derivatives shifted over p frames k times. Before the calculation, zero adding is added so that the number of rows of the resuls is the same as for x.

Am I reading it right that feacalc() actually has language detection?

Even Linto are using the python based Mycroft repo for MFCC this is Julia which is really interesting anyway as the speed snake of python, well sort of. But its based on https://labrosa.ee.columbia.edu/matlab/rastamat/ The more a read about it the more my eyes bulged and jaw opened that this without is likely the most comphrensive, fastest most advanced set of MFFC operations in a lib by a country mile. Now at least I understand why librosa but there is a Julia version available which my understanding is its as good as C and readable by many.

RASTA-PLP is extremely interesting and its the first time I have heard of it. But if we get into the situation where we have the datasets then if we apply the audio processing of our KW capture to our model data we can run and allow many different forms via parameterised MFCC settings.

https://github.com/JuliaDSP/MFCC.jl/commits?author=davidavdav

Who I may of offended already without meaning to or likely and correctly thinks I am a moron, is godlike when it comes to DSP and check his github profile.

Maybe its just a stupid question to ask if VAD & MFCC can not be part of the same process as VAD is inspecting a single frame whilst MFCC is processing a series of frames based on each VAD frame output.

Been wondering for a while if we have a duplication of 2 high load processes during KWS that to some degree can be functionally amalgamated.

PS its SAD not VAD and don't replay "Let be SAD then :)" as think the author is a little more serious than I am generally.

I will try to get more info