Investigation: Datasets for Evaluation of Speech Detection

hhuangMITRE commented 1 year ago

From LDC we found the following datasets of interest: • CSR-I (WSJ0) Complete – English only • United Nations Proceedings Speech – Multiple Languages, massive dataset (transcriptions uncertain). • OGI Multilanguage Corpus - Phonetic Transcriptions • CSLU: 22 Languages Corpus – Orthographic (standard spelling of target language) transcriptions • CSLU: Multilanguage Telephone Speech Version 1.2 - Phonetic Transcriptions • Hispanic-English Database - Orthographic transcriptions

hhuangMITRE commented 1 year ago

We decided to start with Hispanic-English Database for our initial investigations as it contained more clean individual English and Spanish utterances of consistent length, organized by speakers, and recorded over microphone, as well as a subsection of combined English utterances over telephone.

The next dataset is CSLU: 22 languages, which contains individual and combined utterances (stories) told by different speakers over telephone. Given the multiple of languages, there were fewer individual Spanish utterances compared to the Hispanic-English Database, some utterances only contained 3 or fewer words (which may not be sufficient for language ID).

hhuangMITRE commented 1 year ago

For now CSLU and Hispanic-English data are sufficient for our preliminary investigation.

openmpf / openmpf-evaluation

Investigation: Datasets for Evaluation of Speech Detection #7