Automatic spoken language identification (LID) using deep learning.
We wanted to classify the spoken language within audio files, a process that usually serves as the first step for NLP or speech transcription.
We used two deep learning approaches using the Tensorflow and Caffe frameworks for different model configuration.
./run.sh --inputPath {input_path} --outputPath {output_path} | tee sparkline.log -
train.py
and predict.py
.config.yaml
./tensorflow/networks/instances/
.// Install additional Python requirements
pip install -r requirements.txt
pip install youtube_dl
Downloads training data / audio samples from various sources.
/data/voxforge/download-data.sh
/data/voxforge/extract_tgz.sh {path_to_german.tgz} german
youtube/sources.yml
python /data/youtube/download.py
We trained models for 2/4 languages (English, German, French, Spanish).
The top scoring networks were trained with 15.000 images per languages, a batch size of 64, and a learning rate of 0.001 that was decayed to 0.0001 after 7.000 iterations.
// Caffe:
/models/{model_name}/training.sh
// Tensorflow:
python /tensorflow/train.py
0 English,
1 German,
2 French,
3 Spanish
For training we used both the public Voxforge dataset and downloaded news reel videos from Youtube. Check out the /data directory for download scripts.