shenasa-ai / speech2text

A Deep-Learning-Based Persian Speech Recognition System
MIT License
204 stars 29 forks source link
attention attention-mechanism ctc keras mozilla-deepspeech python speech-recognition speech-to-text teacher-forcing tensorflow2

Speech to Text 🚀

This repo create ASR system by using toolkits PLUS own implementations.

Toolkits used:


TIP : If you want to just use the scripts( text_cleaning / data_collecting/ create final dataset csv files) simply use requirements.txt to install important dependencies.

pip3 install requirements.txt

Prerequisites 📋

You need to know about RNNs, Attention mechanism, CTC, a little pandas/numpy, Tensorflow, KERAS and NLP Stuff(e.g. transformers, text cleaning etc. ). Also knowing about Spectrograms, MFCC, Filter Bank will help you to understand preprocess of audios.

if you don't know any of these stuff you can check Wiki page there are many links


Models Output 🎯

model Name DataSet Loss
our own implementations : (first try) common_voice_en (400h) 74 ( So bad results / many models tested. )
Deep Speech 1 : Mozilla (first try) common_voice + tv programms + radio programs (totally 300h) 28
Deep Speech 1 : Mozilla (second try) common_voice + tv programms + radio programs (totally 300h) 25
Deep Speech 1 : Mozilla + Transfer Learning (third try) common_voice + tv programms + radio programs (totally 300h) 24
Deep Speech 1 : Mozilla + Transfer Learning (third try) common_voice + tv programms + radio programs (totally +1000h) 22

Dataset we used 📁

there are many public datasets for English. But for persian there is not enough and free STT dataset.So we created our own data crawler for collecting data.

common voice dataset is a rich free dataset.

How to use our script to collect data

in this repo there is 1 folders:

crawler : this folder has one script. the script will crawl in radio archive and collect the data we need. you can edit this crawler to download other websites too. [More info Check README file in crawler folder]

Full DataSet 📁 ⚡🔥

Here in Hamtech Company, we decided to open source our ASR dataset. This Dataset is near 200Gb of voice plus CSV file which includes the transcription(some files contain txt file not csv). you can find a column named Confidence_level, this means how much the transcription is reliable, here is the, you can use LM(language models) or any other idea to clean them or any other ideas. . In conclusion :


Note : 9 Gb of data is lost. :(

Links :
Version 1 data content is like : WavFile+TxtFile | Version 2 data content is like : ZipFile+CsvFile


NOTE : if you need more tips dont hesitate to email me : masoudparpanchi@gmail.com

Part of Our Dataset V0.1 📁 ⚡🔥

Here in Hamtech Company, we decided to open source a challenging part of our ASR dataset. This Dataset is near 30 Hours of voice plus CSV file which includes the transcription. you can find a column named Confidence_level, this means how much the transcription is reliable, here is the, you can use LM(language models) or any other idea to clean them or any other ideas. The variety of speakers in this dataset is not so much But the quality of voices is good enough. Check Dataset Folder in this repo. In conclusion :

Mozilla Deep Speech

Last checkpoint of trained Speech to text ( These are not ready to use for commercial usecases. only a finetuned model for you, to use it in your own project :

Start Using deepspeech : clone and download common voice dataset

To use this toolkit you must first do what this link says or follow short Installation blow.

Currently we are using DeepSpeech V0.9.3

Short Installation

After cloning and install dependecies you need common voice dataset. Download the proper language and then preprocessing. all the steps for preprocess common voice dataset is documented in here too.

your own dataset

if you want to create your own dataset you need these TIPS :

Language model

you need language model for testing the model. the language model is trained on Kenlm

the steps to train language model is here

text file size : 2GB

test file to optimize : 20M


kenlm checkpoint link (This checkpoint is just a toy language model. trained on 2.5 Gb of Persian txt but not deep optimization ): https://drive.google.com/file/d/1IGL_SXNQdYINWEP93JnbAw1NjxtmZ-Hw/view?usp=share_link

txt dataset to train kenlm can be found here ( you can find near 80 Gb of Persian txt there ): https://nlpdataset.ir/farsi/raw_text_corpora.html



after all these steps it means you have your dataset ready and you want to train . if you want to train on English there is no more steps BUT if you are in another language ( like Persian ) you need to check transfer learning part of this link TIP : don't forget to change alphabete.txt

Question : can I use Other languages checkpoints to start transfer learning? Sure, do it. But remember to drop weights of last N layers.

you may need the meaning of flags to use all the abilities of mozilla deep speech. check their Documentation

Tip : if you faced with CUDA/CudNN errors. try to use conda and install proper versions.

Where to find Persian Pretrained Checkpoints :


Last checkpoint of trained Speech to text ( These are not ready to use for commercial usecases. only a finetuned model for you, to use it in your own project :




DeepSpeech2

using TensorSpeech Link to repository their repo is really complete and you can pass their steps to train a model but I will say some tips :



Wav2vec 2.0

using facebook fairseq toolkit
this checkpoint of wav2vec2 is trained on 30 Gb of Speech dataset( all data with 90percent and higher confidence ): https://drive.google.com/file/d/1DX4R3wyjDiDyQ6-0EKv_0P3WV_co13H6/view?usp=share_link



My Own Implementations

Installation 🔧

Some libraries you need to install. I'll list them here ( These are the most important ) :

pip install pydub
pip install python_speech_features
!apt-get install -y ffmpeg

the codes are developed to use commonvoice data. make sure your data are in that format.

Hyperparameters

All experiments and all hyperparameters : https://drive.google.com/file/d/1h7DhMsS_AGAguKypI_jhjv3JNT2Naemq/view?usp=share_link



WIKI page 📖

Visit our wiki page for more info about Tutorials, useful Links, Hardware Info, Result and other things.


Contributing 🖇️

If you want to help us for better models and new approaches, please contact us, we will be happy
Email : masoudparpanchi@gmail.com