Preprocessing scripts for the Santa Barbara Corpus of Spoken American English (SBCSAE).
pip install -r requirements.txt
Download corpus data to <data dir>
.
./download.sh <data dir>
The data directory should look like
<data dir>/
├── mp3
│ ├── 01.mp3
│ ├── 02.mp3
│ ├── ...
│ └── 60.mp3
└── trn
├── SBC001.trn
├── SBC002.trn
├── ...
└── SBC060.trn
.mp3
files to .wav
files..trn
files into transcriptions for ASR training and evaluation.python3 preprocess.py --trn <data dir>/trn --mp3 <data dir>/mp3 --wav <data dir>/wav --tsv <transcript dir>
After processing, there will be three transcription files train.tsv
, dev.tsv
, and test.tsv
. Some examples in the train.tsv
file is shown as follows
SBC009_000000-001316.wav AM I DOING THAT RIGHT SO FAR
SBC042_020175-021461.wav ALL I NEED IS YOUR SIGNATURE SO I CAN PLAY THE VOLLEYBALL
SBC042_081976-083246.wav WELL I GUESS THAT NOTE
The .wav
files are stored in <data dir>/wav
while the transciptions are in <transcript dir>
. The naming criterion of .wav
files is
SBC<.mp3 index>_<start sec x 100>-<end sec x 100>.wav
E.g., the SBC009_000000-001316.wav
file is the 0.00 sec to the 13.16 sec of 09.mp3
. Note that we remove utterances longer than 15 seconds.
Du Bois, John W., Wallace L. Chafe, Charles Meyer, Sandra A. Thompson, Robert Englebretson, and Nii Martey. 2000-2005. Santa Barbara corpus of spoken American English, Parts 1-4. Philadelphia: Linguistic Data Consortium.
TBA