salute-developers / GigaAM

Foundational Model for Speech Recognition Tasks
113 stars 5 forks source link
emotion-recognition foundation-models self-supervised-learning speech-recognition

GigaAM: the family of open-source acoustic models for speech processing

plot

Table of contents

GigaAM

GigaAM (Giga Acoustic Model) is a Conformer-based wav2vec2 foundational model (around 240M parameters). We trained GigaAM on nearly 50 thousand hours of diversified speech audio in the Russian language.

Resources:

GigaAM for Speech Recognition

We fine-tuned the GigaAM encoder for Speech Recognition with two different decoders:

Both models were trained using the NeMo toolkit on publicly available Russian labeled data:

dataset size, hours weight
Golos 1227 0.6
SOVA 369 0.2
Russian Common Voice 207 0.1
Russian LibriSpeech 93 0.1

Resources:

The following table summarizes the performance of different models in terms of Word Error Rate on open Russian datasets:

model parameters Golos Crowd Golos Farfield OpenSTT Youtube OpenSTT Phone calls OpenSTT Audiobooks Mozilla Common Voice Russian LibriSpeech
Whisper-large-v3 1.5B 17.4 14.5 21.1 31.2 17.0 5.3 9.0
NVIDIA Ru-FastConformer-RNNT 115M 2.6 6.6 23.8 32.9 16.4 2.7 11.6
GigaAM-CTC 242M 3.1 5.7 18.4 25.6 15.1 1.7 8.1
GigaAM-RNNT 243M 2.3 4.4 16.7 22.9 13.9 0.9 7.4

GigaAM-Emo

GigaAM-Emo is an acoustic model for Emotion Recognition. We fine-tuned the GigaAM Encoder on the Dusha dataset.

Resources:

The following table summarizes the performance of different models on the Dusha dataset:

Crowd Podcast
Unweighted Accuracy Weighted Accuracy Macro F1-score Unweighted Accuracy Weighted Accuracy Macro F1-score
DUSHA baseline
(MobileNetV2 + Self-Attention)
0.83 0.76 0.77 0.89 0.53 0.54
АБК (TIM-Net) 0.84 0.77 0.78 0.90 0.50 0.55
GigaAM-Emo 0.90 0.87 0.84 0.90 0.76 0.67

Links