linto-ai / linto-desktoptools-hmg

GUI Tool to create, manage and test Keyword Spotting models using TF 2.0
GNU Affero General Public License v3.0
12 stars 2 forks source link

Estimated schedule of completion #6

Closed StuartIanNaylor closed 3 years ago

StuartIanNaylor commented 3 years ago

Hi guys apols to bother but just wondering what your schedule is with the HMG as its a great tool but wondering when and what it will be.

What is due to v1.0 which notice is have a rebuild from the ground up and wondering what net architecture you may use or are you sticking with a basic GRU. Also same MFCC as is the wait due to the likes of tensorflow.keras.layers.experimental.preprocessing / tensorflow.io and how that might pan out? As that likely means when it out of your hands, but just wondering as been excited and waiting for some time now.

Lokhozt commented 3 years ago

Hello Stuart,

Hi guys apols to bother but just wondering what your schedule is with the HMG as its a great tool but wondering when and what it will be.

1.0 "should" be ready with the core functionalities next week. Including dataset management, feature profiles, model profile, training, testing and export.

Release binaries (including ubuntu 20) might come alongside depending on my ability to reduce the executable size to an acceptable level.

What is due to v1.0 which notice is have a rebuild from the ground up and wondering what net architecture you may use or are you sticking with a basic GRU.

As for the net architecture, release will keep the GRU based approach but will allow to add others later (I'm thinking CNN, attention, ...).

Also same MFCC as is the wait due to the likes of tensorflow.keras.layers.experimental.preprocessing / tensorflow.io and how that might pan out?

No idea.

StuartIanNaylor commented 3 years ago

Cheers :) good news.

Its pretty easy to build anyway biggest problem with older machines is tensorflow and autokeras as spent some time finding which version would work.

GRU is fine but CRNN supposedly more accurate for far less ops. Less ops than a CNN.

Going on what Arm say... https://github.com/ARM-software/ML-KWS-for-MCU

Tensorflow.io seem to be a lib for Mfcc and file operations. But you can do like my hacky attempts with 2.4 https://github.com/StuartIanNaylor/simple_audio_tensorflow/blob/main/simple_audio_mfcc_frame_length256_frame_step128.py You can drop sonopy...

Lokhozt commented 3 years ago

Cheers :) good news.

Its pretty easy to build anyway biggest problem with older machines is tensorflow and autokeras as spent some time finding which version would work.

GRU is fine but CRNN supposedly more accurate for far less ops. Less ops than a CNN.

Going on what Arm say... https://github.com/ARM-software/ML-KWS-for-MCU

I took a look some time ago, without going further, i will investigate it more thoughtfully. Model scale is not the same though: 5k (for simple GRU) parameters to 229k (for CRNN). (https://arxiv.org/pdf/1703.05390.pdf). It's not specially a problem at the end.

Tensorflow.io seem to be a lib for Mfcc and file operations. But you can do like my hacky attempts with 2.4 https://github.com/StuartIanNaylor/simple_audio_tensorflow/blob/main/simple_audio_mfcc_frame_length256_frame_step128.py You can drop sonopy...

How is it performance-wise ? I did some tests back a year or so and i remember that it was far behind numpy/scipy based implementations. As for sonopy, it will not be in the 1.0 as it lacks mfcc steps and has implementation errors.

StuartIanNaylor commented 3 years ago

Yeah I did the sonopy performance tests and then changed some of the parameters and reran. Something is not right about sonopy and I feel its currently only highly performant because it may be skipping bins or something as if you do change from the default parameters the speed increase over other libs is huge and dubious. Not sure what or why?

I am really interested in distributed wide array microphones where KWS broadcast till silence is really important to lower network traffic. All beamformers suffer badly when passing through a predominant noise field and rather have a single microphone packed with DSP simple distributed microphones can be more effective as near voice / far noise can be assured by placement of a distributed array of simple unidirectional electrets. To make things cost effective its low end ESP32/microcontrollers to things like a PiZero/RK3308 so what was said about latency and ops with a CRNN sounded like it was beneficial.

Apols about my coding but with tensorflow 2.4 I just added the MFCC conversion to the Spectrogram function and its so performant I am wondering if the code is wrong. Since 20 years of MS I don't code anymore but after waiting for a while being new year I had decided that maybe I should explore something else so hence hearing HMG will be released soon is a relief to my ability of Python programming. Also after testing pytorch torchaudio and finding it is built with intel MKL only optimisation I wondered and hoped this was not widespread.

I think you will have to take a look as compared to what you had with the last working version of HMG the above is much faster.

The softmax scalar score is massively important to me as in a distributed array that score is an indication not only of KW hit but also the current best mic from a distributed array.

PS Am I being dumb or is it weird they used mel spectrograms rather than MFCC as MFCC is just a compressed hence smaller image. I noticed with spectrogram (not mel) vs MFCC as well as the compression MFCC also due to format seem to give 1-2% accuracy increase by use alone over spectrograms.

  spectrogram = tf.abs(spectrogram)

  # Warp the linear scale spectrograms into the mel-scale.
  num_spectrogram_bins = spectrogram.shape[-1]
  lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80
  linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,  upper_edge_hertz)
  mel_spectrogram = tf.tensordot(spectrogram, linear_to_mel_weight_matrix, 1)
  mel_spectrogram.set_shape(spectrogram.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))

  # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
  log_mel_spectrogram = tf.math.log(mel_spectrogram + 1e-6)

  # Compute MFCCs from log_mel_spectrograms and take the first 13.
  spectrogram = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrogram)[..., :13]

  return spectrogram

I never did check out the difference between MFCC and Mel_spectrogram and maybe I should.

`32 (20,5) (8,2) 2 32 GRU 64 229k 2.85 3.79 Looks good to me as there seems to be a reduction in ops and also a accuracy increase whilst even though a DS-CNN is more accurate maybe its less interesting due to ops?

Both those 2 methods are interesting enough for ARM https://arxiv.org/pdf/1711.07128.pdf but always that terrible dataset of the Google command set is used as its full of almost 10% bad samples in terms of trims, padding and null audio. Using HMG and weeding out the bad samples resulted in a huge accuracy improvement with the old HMG model, its why I became a massive fan of the HMG as without an immediate GUI I would of never realised or explored the sheer number of bad samples in that dataset.

It is a really bad dataset anyway as it contains far to many samples for each label and not enough labels, doesn't even compromise of a simple phonetic pangram such as http://clagnut.com/blog/2380/

I have been wondering due to the lack of command sets and the large number of ASR sentence datasets due to having transcripts are you guys planing any tools to grab words from ASR datasets to make KWS datasets?

StuartIanNaylor commented 3 years ago

PS I did the same again with pruning the google command set and wondering if this would be a addition to HMG as a sperate window?

What I am doing is loading all directories as there own label and then running inference on each label wav against that label and pruning all that have low softmax score (I did < .4 delete). Prob a bit aggressive as should start lower but just wondering on acuracy effects as its just the complete dross you need to remove. Its a long winded process but at least much easier than doing it manually via failures.

By doing it even the poor CNN I have been playing with gained about 5% accuracy without too much sample loss. My sample count is down to 7767 but I did hit 92% a couple of times (should of checked what I started with) .

StuartIanNaylor commented 3 years ago

https://github.com/StuartIanNaylor/crispy-succotash

Its very rough ASR word extractor purely as an example