This repository contains the code required to download audiodata from openspeechcorpus.com
Open Speech Corpus is composed by far for three subcorpuses:
To download files from the Tales Project use
ops \
--output_folder tales/ \
--output_file tales.txt \
--corpus tales
To download files from the Isolated Words Project use
ops \
--output_folder isolated_words/ \
--output_file isolated_words.txt \
--corpus words
To download files from the Aphasia Project use
ops \
--output_folder aphasia/ \
--output_file aphasia.txt \
--corpus aphasia
--download_all
You can download the whole corpus using the flag --download_all
ops \
--output_folder aphasia/ \
--output_file aphasia.txt \
--corpus aphasia \
--download_all
--from
and --to
By default the page size is 500, to modify it use the args --from
and --to
i.e:
ops \
--from 500 \
--to 1000 \
--output_folder aphasia/ \
--output_file aphasia.txt \
--corpus aphasia
If you use the flag --download_all
with the flag --from
the process will start in the specified arg from
using a
page size of 500
--extra_query_params
We also support an argument --extra_query_params
which overwrites the --from
, --to
and --download_all
arguments
and downloads all files in the body response, you must define the --corpus
argument anyway
ops \
--output_folder aphasia/ \
--output_file aphasia_letters.txt \
--corpus aphasia \
--extra_query_params "level_sentence__id__gte=846&level_sentence__id__lte=870"
The Open Speech Corpus stores its files in mp4 format, which is undesired for most audio processing tasks. To convert
the files into a wav format, first install ffmpeg, then you can execute the
recursive_convert
utility which receives as first argument the source_folder with the mp4 files and as second argument
the output folder i.e.:
recursive_convert aphasia aphasia_wav
The Open Speech Corpus also defines some scripts to generate configurations for CMU Sphinx.
First initialize a project with the sphinx_train
command
sphinxtrain -t simple_words setup
To generate a configuration use the command configure_sphinx
, which creates the transcription, fileids, fillers and
dic files.
configure_sphinx simple_words \
--transcription_file words.txt \
--etc_folder simple_words/etc \
--test_size 0.5
Also you need to define a language model which receives the DB_NAME and the base project folder
generate_language_model simple_words simple_words
To delete the configuration files use the command clean_previous_configuration
clean_previous_configuration simple_words --etc_folder simple_words/etc/
The Open Speech Corpus also defines some scripts to train models using HTK
To generate a word grammar use
configure_htk \
--transcription_file words.txt \
--project_folder htk_words \
--wav_folder words_wav \
htk_words