Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus.
Domain | Train files | Train hours | Test files | Test hours |
---|---|---|---|---|
Crowd | 979 796 | 1 095 | 9 994 | 11.2 |
Farfield | 124 003 | 132.4 | 1 916 | 1.4 |
Total | 1 103 799 | 1 227.4 | 11 910 | 12.6 |
Archive | Size | Link |
---|---|---|
golos_opus.tar | 20.5 GB | https://sc.link/JpD |
Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed in the table:
Archives | Size | Links |
---|---|---|
train_farfield.tar | 15.4 GB | https://sc.link/1Z3 |
train_crowd0.tar | 11 GB | https://sc.link/Lrg |
train_crowd1.tar | 14 GB | https://sc.link/MvQ |
train_crowd2.tar | 13.2 GB | https://sc.link/NwL |
train_crowd3.tar | 11.6 GB | https://sc.link/Oxg |
train_crowd4.tar | 15.8 GB | https://sc.link/Pyz |
train_crowd5.tar | 13.1 GB | https://sc.link/Qz7 |
train_crowd6.tar | 15.7 GB | https://sc.link/RAL |
train_crowd7.tar | 12.7 GB | https://sc.link/VG5 |
train_crowd8.tar | 12.2 GB | https://sc.link/WJW |
train_crowd9.tar | 8.08 GB | https://sc.link/XKk |
test.tar | 1.3 GB | https://sc.link/Kqr |
Acoustic model built using QuartzNet15x5 architecture and trained using NeMo toolkit
Three n-gram language models created using KenLM Language Model Toolkit
Archives | Size | Links |
---|---|---|
QuartzNet15x5_golos.nemo | 68 MB | https://sc.link/ZMv |
KenLMs.tar | 4.8 GB | https://sc.link/YL0 |
Golos data and models are also available in the hub of pre-trained models, datasets, and containers - DataHub ML Space. You can train the model and deploy it on the high-performance SberCloud infrastructure in ML Space - full-cycle machine learning development platform for DS-teams collaboration based on the Christofari Supercomputer.
Percents of Word Error Rate for different test sets
Decoder \ Test set | Crowd test | Farfield test | MCV1 dev | MCV1 test |
---|---|---|---|---|
Greedy decoder | 4.389 % | 14.949 % | 9.314 % | 11.278 % |
Beam Search with Common Crawl LM | 4.709 % | 12.503 % | 6.341 % | 7.976 % |
Beam Search with Golos train set LM | 3.548 % | 12.384 % | - | - |
Beam Search with Common Crawl and Golos LM | 3.318 % | 11.488 % | 6.4 % | 8.06 % |
1 Common Voice - Mozilla's initiative to help teach machines how real people speak.
[arxiv.org] Golos: Russian Dataset for Speech Research
[habr.com] Как улучшить распознавание русской речи до 3% WER с помощью открытых данных
Please create a GitHub issue!
Authors (in alphabetic order):