snakers4 commented 3 years ago

Just a handy issue to be notified of latest changes and micro-releases (we will mostly changing the models)

snakers4 commented 3 years ago

Initial models, examples, utils for VAD only uploaded (no number detector or language classifier yet)

snakers4 commented 3 years ago

First readable public release

snakers4 commented 3 years ago

Added VAD latency and throughput metrics

snakers4 commented 3 years ago

Updated VAD quality Before / after (precision / recall)

adamnsandle commented 3 years ago

Added < 250ms compatibility

Sontref commented 3 years ago

Added number detector

snakers4 commented 3 years ago

Language detector example, readme update + FAQ

snakers4 commented 3 years ago

Audiotok benchmarks added Looks like all energy based solutions are kind of similar

snakers4 commented 3 years ago

Added a utility to tune the VAD params properly for a domain

snakers4 commented 3 years ago

Some final benchmarks posted here - https://github.com/pyannote/pyannote-audio/issues/604#issue-798003383 Probably we are done with benchmarks for now

snakers4 commented 3 years ago

Added micro (10k params, 100x smaller) VAD models

snakers4 commented 3 years ago

Added micro (10k params, 100x smaller) VAD models for 8 kHz audio

snakers4 commented 3 years ago

Added mini (100k params) VAD models for 8 kHz and 16 kHz
Added adaptive vad iterator

https://github.com/snakers4/silero-vad/pull/54

snakers4 commented 3 years ago

Fixed examples and notebooks
Updated README
Added adaptive examples

snakers4 commented 3 years ago

Added a language classifier for 116 languages
It classifies audios into languages and mutually intelligible language groups (i.e. Serbian + Bosnian + Croatian, Russian + Ukranian + others, Hindi + Urdu, etc), see the full list here and here
Probably some artifical / unspoken languages will be excluded and a large model will be trained

snakers4 commented 3 years ago

improved language classifier

95 languages (85% accuracy), 58 language groups (90% accuracy)
Mutually intelligible languages are united into language groups (i.e. Serbian + Croatian + Bosnian are very similar)
Trained on approx 20k hours of data (10k of which are for 5 most popular languages)
4.7M params

snakers4 commented 2 years ago

updated further reading section

snakers4 commented 2 years ago

New V3 Silero VAD is Already Here

Main changes

One VAD to rule them all! New model includes the functionality of the previous ones with improved quality and speed!
Flexible sampling rate, 8000 Hz and 16000 Hz are supported;
Flexible chunk size, minimum chunk size is just 30 milliseconds!
100k parameters;
GPU and batching are supported;
Radically simplified examples;

Migration

Please see the new examples.

New get_speech_timestamps is a simplified and unified version of the old deprecated get_speech_ts or get_speech_ts_adaptive methods.

speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=16000)

New VADIterator class serves as an example for streaming tasks instead of old deprecated VADiterator and VADiteratorAdaptive.

vad_iterator = VADIterator(model)
window_size_samples = 1536

for i in range(0, len(wav), window_size_samples):
   speech_dict = vad_iterator(wav[i: i+ window_size_samples], return_seconds=True)
   if speech_dict:
       print(speech_dict, end=' ')
vad_iterator.reset_states()

snakers4 commented 2 years ago

Even Better V3 Silero VAD

Models with even higher quality (just see the plots with metrics!);
New model ~ large model >> all previous (even large) models;
Now model works properly quality-wise, i.e. 100ms > 60ms > 30ms and16 kHz > 8 kHz;

snakers4 commented 2 years ago

This summarises new progress well

snakers4 commented 2 years ago

New V3 ONNX VAD Released

We finally were able to port a model to ONNX:

Compact model (~100k params);
Both PyTorch and ONNX models are not quantized;
Same quality model as the latest best PyTorch release;
Only 16kHz available now (ONNX has some issues with if-statements and / or tracing vs scripting) with cryptic errors;
In our tests, on short audios (chunks) ONNX is 2-3x faster than PyTorch (this is mitigated with larger batches or long audios);
Audio examples and non-core models moved out of the repo to save space;

snakers4 commented 2 years ago

Support For Sampling Rates Higher Than 16 kHz

jit model now can handle 8, 16, 32 and 48 kHz directly (change implemented within the model itself);
onnx model as well, but only via external wrappers (we just use each n-th sample for higher sampling rates);
This support is mostly a hack, i.e. we just use each n-th sample for higher sampling rates (instead of averaging);

snakers4 commented 2 years ago

⚠️ Important Information for VAD Python Users ⚠️

If you are using the VAD in a:

multi-threaded or
a multi-process application

Do not forget to disable gradients in EACH process and / or thread. Otherwise memory may leak noticeably.

snakers4 commented 2 years ago

adamnsandle commented 1 year ago

New V4 VAD Released

Changes:

Improved quality
Improved perfomance
Both 8k and 16k sampling rates are now supported by the ONNX model
Batching is now supported by the ONNX model
Added audio_forward method for one-line processing of a single or multiple audio without postprocessing

snakers4 commented 1 year ago

It is worth posting this chart:

snakers4 commented 1 year ago

Remove picovoice mentions

snakers4 commented 1 year ago

Deprecate language classifier and number detector models, since they are not maintained anymore.

snakers4 commented 1 month ago

Finally, V5 is here, 3x faster, supporting 6000+ languages!

Performance and Model Size

3x faster inference for TorchScript, 10% faster inference for ONNX;
Now TorchScript is as fast as ONNX;
Model size is 2x larger, 2MB vs. 1MB;

Quality

The VAD supports more than 6,000 languages now;
Significanly more robust on noisy data;
Overall 5-7% quality increase on clean data;
Quality difference for 8 kHz and 16 kHz is negligible now;
Quality difference for different window sizes is negligible => window size was deprecated;
Added benchmarks on 9 unique datasets (2 private) and one holistic multi-domain dataset;

Changes and deprecations

ONNX opset 16;
window_size_samples is deprecated - now the VAD only works with fixed size window;
VAD now works with 8 kHz and 16 kHz sample rates, only with fixed 256 and 512 sample windows respectively;
Slightly changed internal logic, now some context (part of previous chunk) is passed along with the current chunk;
Sample rates that are a multiple of 16 kHz are still supported;

snakers4 commented 1 month ago

V5.1 - Experimental PyPI Package Release

Experimental pip-package release;
Community PRs to update the examples;

What's Changed

Adamnsandle by @adamnsandle in https://github.com/snakers4/silero-vad/pull/481
Update microphone_and_webRTC_integration.py by @eltociear in https://github.com/snakers4/silero-vad/pull/475
cpp example by @filtercodes in https://github.com/snakers4/silero-vad/pull/482
Update Golang example to support model v5 by @streamer45 in https://github.com/snakers4/silero-vad/pull/489
Create python-publish.yml by @adamnsandle in https://github.com/snakers4/silero-vad/pull/492
Adamnsandle by @adamnsandle in https://github.com/snakers4/silero-vad/pull/493

New Contributors

@eltociear made their first contribution in https://github.com/snakers4/silero-vad/pull/475
@filtercodes made their first contribution in https://github.com/snakers4/silero-vad/pull/482

Full Changelog: https://github.com/snakers4/silero-vad/compare/v5.0...v5.1

snakers4 / silero-vad

Changelog - V5 just released! #2

New V3 Silero VAD is Already Here

Main changes

Migration

Even Better V3 Silero VAD

New V3 ONNX VAD Released

Support For Sampling Rates Higher Than 16 kHz

⚠️ Important Information for VAD Python Users ⚠️

New V4 VAD Released

Finally, V5 is here, 3x faster, supporting 6000+ languages!

Performance and Model Size

Quality

Changes and deprecations

V5.1 - Experimental PyPI Package Release

What's Changed

New Contributors