snakers4 / silero-models

Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
Other
4.96k stars 312 forks source link
asr capitalization colab english german onnx pretrained-models pytorch repunctuation spanish speech speech-recognition speech-synthesis speech-to-text stt stt-benchmark text-to-speech torch-hub tts tts-models

Mailing list : test Mailing list : test License: CC BY-NC 4.0

Donations Backers Sponsors

Build and Deploy to PyPI PyPI version

header

Silero Models

Silero Models: pre-trained enterprise-grade STT / TTS models and benchmarks.

Enterprise-grade STT made refreshingly simple (seriously, see benchmarks). We provide quality comparable to Google's STT (and sometimes even better) and we are not Google.

As a bonus:

Also we have published TTS models that satisfy the following criteria:

Also we have published a model for text repunctuation and recapitalization that:

Installation and Basics

You can basically use our models in 3 flavours:

Models are downloaded on demand both by pip and PyTorch Hub. If you need caching, do it manually or via invoking a necessary model once (it will be downloaded to a cache folder). Please see these docs for more information.

PyTorch Hub and pip package are based on the same code. All of the torch.hub.load examples can be used with the pip package via this basic change:

# before
torch.hub.load(repo_or_dir='snakers4/silero-models',
               model='silero_stt',  # or silero_tts or silero_te
               **kwargs)

# after
from silero import silero_stt, silero_tts, silero_te
silero_stt(**kwargs)

Speech-To-Text

All of the provided models are listed in the models.yml file. Any metadata and newer versions will be added there.

Screenshot_1

Currently we provide the following checkpoints:

PyTorch ONNX Quantization Quality Colab
English (en_v6) :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: link Open In Colab
English (en_v5) :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: link Open In Colab
German (de_v4) :heavy_check_mark: :heavy_check_mark: :hourglass: link Open In Colab
English (en_v3) :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: link Open In Colab
German (de_v3) :heavy_check_mark: :hourglass: :hourglass: link Open In Colab
German (de_v1) :heavy_check_mark: :heavy_check_mark: :hourglass: link Open In Colab
Spanish (es_v1) :heavy_check_mark: :heavy_check_mark: :hourglass: link Open In Colab
Ukrainian (ua_v3) :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: N/A Open In Colab

Model flavours:

jit jit jit jit jit_q jit_q onnx onnx onnx onnx
xsmall small large xlarge xsmall small xsmall small large xlarge
English en_v6 :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
English en_v5 :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
English en_v4_0 :heavy_check_mark: :heavy_check_mark:
English en_v3 :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
German de_v4 :heavy_check_mark: :heavy_check_mark:
German de_v3 :heavy_check_mark:
German de_v1 :heavy_check_mark: :heavy_check_mark:
Spanish es_v1 :heavy_check_mark: :heavy_check_mark:
Ukrainian ua_v3 :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:

Dependencies

Please see the provided Colab for details for each example below. All examples are maintained to work with the latest major packaged versions of the installed libraries.

PyTorch

Open In Colab

Open on Torch Hub

import torch
import zipfile
import torchaudio
from glob import glob

device = torch.device('cpu')  # gpu also works, but our models are fast enough for CPU
model, decoder, utils = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                       model='silero_stt',
                                       language='en', # also available 'de', 'es'
                                       device=device)
(read_batch, split_into_batches,
 read_audio, prepare_model_input) = utils  # see function signature for details

# download a single file in any format compatible with TorchAudio
torch.hub.download_url_to_file('https://opus-codec.org/static/examples/samples/speech_orig.wav',
                               dst ='speech_orig.wav', progress=True)
test_files = glob('speech_orig.wav')
batches = split_into_batches(test_files, batch_size=10)
input = prepare_model_input(read_batch(batches[0]),
                            device=device)

output = model(input)
for example in output:
    print(decoder(example.cpu()))

ONNX

Open In Colab

Our model will run anywhere that can import the ONNX model or that supports the ONNX runtime.

import onnx
import torch
import onnxruntime
from omegaconf import OmegaConf

language = 'en' # also available 'de', 'es'

# load provided utils
_, decoder, utils = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_stt', language=language)
(read_batch, split_into_batches,
 read_audio, prepare_model_input) = utils

# see available models
torch.hub.download_url_to_file('https://raw.githubusercontent.com/snakers4/silero-models/master/models.yml', 'models.yml')
models = OmegaConf.load('models.yml')
available_languages = list(models.stt_models.keys())
assert language in available_languages

# load the actual ONNX model
torch.hub.download_url_to_file(models.stt_models.en.latest.onnx, 'model.onnx', progress=True)
onnx_model = onnx.load('model.onnx')
onnx.checker.check_model(onnx_model)
ort_session = onnxruntime.InferenceSession('model.onnx')

# download a single file in any format compatible with TorchAudio
torch.hub.download_url_to_file('https://opus-codec.org/static/examples/samples/speech_orig.wav', dst ='speech_orig.wav', progress=True)
test_files = ['speech_orig.wav']
batches = split_into_batches(test_files, batch_size=10)
input = prepare_model_input(read_batch(batches[0]))

# actual ONNX inference and decoding
onnx_input = input.detach().cpu().numpy()
ort_inputs = {'input': onnx_input}
ort_outs = ort_session.run(None, ort_inputs)
decoded = decoder(torch.Tensor(ort_outs[0])[0])
print(decoded)

TensorFlow

Open In Colab

SavedModel example

import os
import torch
import subprocess
import tensorflow as tf
import tensorflow_hub as tf_hub
from omegaconf import OmegaConf

language = 'en' # also available 'de', 'es'

# load provided utils using torch.hub for brevity
_, decoder, utils = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_stt', language=language)
(read_batch, split_into_batches,
 read_audio, prepare_model_input) = utils

# see available models
torch.hub.download_url_to_file('https://raw.githubusercontent.com/snakers4/silero-models/master/models.yml', 'models.yml')
models = OmegaConf.load('models.yml')
available_languages = list(models.stt_models.keys())
assert language in available_languages

# load the actual tf model
torch.hub.download_url_to_file(models.stt_models.en.latest.tf, 'tf_model.tar.gz')
subprocess.run('rm -rf tf_model && mkdir tf_model && tar xzfv tf_model.tar.gz -C tf_model',  shell=True, check=True)
tf_model = tf.saved_model.load('tf_model')

# download a single file in any format compatible with TorchAudio
torch.hub.download_url_to_file('https://opus-codec.org/static/examples/samples/speech_orig.wav', dst ='speech_orig.wav', progress=True)
test_files = ['speech_orig.wav']
batches = split_into_batches(test_files, batch_size=10)
input = prepare_model_input(read_batch(batches[0]))

# tf inference
res = tf_model.signatures["serving_default"](tf.constant(input.numpy()))['output_0']
print(decoder(torch.Tensor(res.numpy())[0]))

Text-To-Speech

Models and Speakers

All of the provided models are listed in the models.yml file. Any metadata and newer versions will be added there.

V4

V4 models support SSML. Also see Colab examples for main SSML tag usage.

ID Speakers Auto-stress Language SR Colab
v4_ru aidar, baya, kseniya, xenia, eugene, random yes ru (Russian) 8000, 24000, 48000 Open In Colab
v4_cyrillic b_ava, marat_tt, kalmyk_erdni... no cyrillic (Avar, Tatar, Kalmyk, ...) 8000, 24000, 48000 Open In Colab
v4_ua mykyta, random no ua (Ukrainian) 8000, 24000, 48000 Open In Colab
v4_uz dilnavoz no uz (Uzbek) 8000, 24000, 48000 Open In Colab
v4_indic hindi_male, hindi_female, ..., random no indic (Hindi, Telugu, ...) 8000, 24000, 48000 Open In Colab

V3

V3 models support SSML. Also see Colab examples for main SSML tag usage.

ID Speakers Auto-stress Language SR Colab
v3_en en_0, en_1, ..., en_117, random no en (English) 8000, 24000, 48000 Open In Colab
v3_en_indic tamil_female, ..., assamese_male, random no en (English) 8000, 24000, 48000 Open In Colab
v3_de eva_k, ..., karlsson, random no de (German) 8000, 24000, 48000 Open In Colab
v3_es es_0, es_1, es_2, random no es (Spanish) 8000, 24000, 48000 Open In Colab
v3_fr fr_0, ..., fr_5, random no fr (French) 8000, 24000, 48000 Open In Colab
v3_indic hindi_male, hindi_female, ..., random no indic (Hindi, Telugu, ...) 8000, 24000, 48000 Open In Colab

Dependencies

Basic dependencies for Colab examples:

PyTorch

Open In Colab

Open on Torch Hub

# V4
import torch

language = 'ru'
model_id = 'v4_ru'
sample_rate = 48000
speaker = 'xenia'
device = torch.device('cpu')

model, example_text = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                     model='silero_tts',
                                     language=language,
                                     speaker=model_id)
model.to(device)  # gpu or cpu

audio = model.apply_tts(text=example_text,
                        speaker=speaker,
                        sample_rate=sample_rate)

Standalone Use

# V4
import os
import torch

device = torch.device('cpu')
torch.set_num_threads(4)
local_file = 'model.pt'

if not os.path.isfile(local_file):
    torch.hub.download_url_to_file('https://models.silero.ai/models/tts/ru/v4_ru.pt',
                                   local_file)  

model = torch.package.PackageImporter(local_file).load_pickle("tts_models", "model")
model.to(device)

example_text = 'В недрах тундры выдры в г+етрах т+ырят в вёдра ядра кедров.'
sample_rate = 48000
speaker='baya'

audio_paths = model.save_wav(text=example_text,
                             speaker=speaker,
                             sample_rate=sample_rate)

SSML

Check out our TTS Wiki page.

Cyrillic languages

Supported tokenset: !,-.:?iµöабвгдежзийклмнопрстуфхцчшщъыьэюяёђѓєіјњћќўѳғҕҗҙқҡңҥҫүұҳҷһӏӑӓӕӗәӝӟӥӧөӱӳӵӹ

Speaker_ID Language Gender
b_ava Avar F
b_bashkir Bashkir M
b_bulb Bulgarian M
b_bulc Bulgarian M
b_che Chechen M
b_cv Chuvash M
cv_ekaterina Chuvash F
b_myv Erzya M
b_kalmyk Kalmyk M
b_krc Karachay-Balkar M
kz_M1 Kazakh M
kz_M2 Kazakh M
kz_F3 Kazakh F
kz_F1 Kazakh F
kz_F2 Kazakh F
b_kjh Khakas F
b_kpv Komi-Ziryan M
b_lez Lezghian M
b_mhr Mari F
b_mrj Mari High M
b_nog Nogai F
b_oss Ossetic M
b_ru Russian M
b_tat Tatar M
marat_tt Tatar M
b_tyv Tuvinian M
b_udm Udmurt M
b_uzb Uzbek M
b_sah Yakut M
kalmyk_erdni Kalmyk M
kalmyk_delghir Kalmyk F

Indic languages

Example

(!!!) All input sentences should be romanized to ISO format using aksharamukha. An example for hindi:

# V3
import torch
from aksharamukha import transliterate

# Loading model
model, example_text = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                     model='silero_tts',
                                     language='indic',
                                     speaker='v4_indic')

orig_text = "प्रसिद्द कबीर अध्येता, पुरुषोत्तम अग्रवाल का यह शोध आलेख, उस रामानंद की खोज करता है"
roman_text = transliterate.process('Devanagari', 'ISO', orig_text)
print(roman_text)

audio = model.apply_tts(roman_text,
                        speaker='hindi_male')

Supported languages

Language Speakers Romanization function
hindi hindi_female, hindi_male transliterate.process('Devanagari', 'ISO', orig_text)
malayalam malayalam_female, malayalam_male transliterate.process('Malayalam', 'ISO', orig_text)
manipuri manipuri_female transliterate.process('Bengali', 'ISO', orig_text)
bengali bengali_female, bengali_male transliterate.process('Bengali', 'ISO', orig_text)
rajasthani rajasthani_female, rajasthani_female transliterate.process('Devanagari', 'ISO', orig_text)
tamil tamil_female, tamil_male transliterate.process('Tamil', 'ISO', orig_text, pre_options=['TamilTranscribe'])
telugu telugu_female, telugu_male transliterate.process('Telugu', 'ISO', orig_text)
gujarati gujarati_female, gujarati_male transliterate.process('Gujarati', 'ISO', orig_text)
kannada kannada_female, kannada_male transliterate.process('Kannada', 'ISO', orig_text)

Text-Enhancement

Languages Quantization Quality Colab
'en', 'de', 'ru', 'es' :heavy_check_mark: link Open In Colab

Dependencies

Basic dependencies for Colab examples:

Standalone Use

import torch

model, example_texts, languages, punct, apply_te = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                                                  model='silero_te')

input_text = input('Enter input text\n')
apply_te(input_text, lan='en')

Denoise

Denoise models attempt to reduce background noise along with various artefacts such as reverb, clipping, high/lowpass filters etc., while trying to preserve and/or enhance speech. They also attempt to enhance audio quality and increase sampling rate of the input up to 48kHz.

Models

All of the provided models are listed in the models.yml file.

Model JIT Real Input SR Input SR Output SR Colab
small_slow :heavy_check_mark: 8000, 16000, 24000, 44100, 48000 24000 48000 Open In Colab
large_fast :heavy_check_mark: 8000, 16000, 24000, 44100, 48000 24000 48000 Open In Colab
small_fast :heavy_check_mark: 8000, 16000, 24000, 44100, 48000 24000 48000 Open In Colab

Dependencies

Basic dependencies for Colab examples:

PyTorch

Open In Colab


import torch

name = 'small_slow'
device = torch.device('cpu')
model, samples, utils = torch.hub.load(
  repo_or_dir='snakers4/silero-models',
  model='silero_denoise',
  name=name,
  device=device)
(read_audio, save_audio, denoise) = utils

i = 0
torch.hub.download_url_to_file(
  samples[i],
  dst=f'sample{i}.wav',
  progress=True
)
audio_path = f'sample{i}.wav'
audio = read_audio(audio_path).to(device)
output = model(audio)
save_audio(f'result{i}.wav', output.squeeze(1).cpu())

i = 1
torch.hub.download_url_to_file(
  samples[i],
  dst=f'sample{i}.wav',
  progress=True
)
output, sr = denoise(model, f'sample{i}.wav', f'result{i}.wav', device='cpu')

Standalone Use

import os
import torch

device = torch.device('cpu')
torch.set_num_threads(4)
local_file = 'model.pt'

if not os.path.isfile(local_file):
    torch.hub.download_url_to_file('https://models.silero.ai/denoise_models/sns_latest.jit',
                                   local_file)  

model = torch.jit.load(local_file)
torch._C._jit_set_profiling_mode(False) 
torch.set_grad_enabled(False)
model.to(device)

a = torch.rand((1, 48000))
a = a.to(device)
out = model(a)

FAQ

Wiki

Also check out our wiki.

Performance and Quality

Please refer to these wiki sections:

Adding new Languages

Please refer here.

Contact

Get in Touch

Try our models, create an issue, join our chat, email us, and read the latest news.

Commercial Inquiries

Please refer to our wiki and the Licensing and Tiers page for relevant information, and email us.

Citations

@misc{Silero Models,
  author = {Silero Team},
  title = {Silero Models: pre-trained enterprise-grade STT / TTS models and benchmarks},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-models}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

Further reading

English

Chinese

Russian

Donations

Please use the "sponsor" button.