Open couragelfyang opened 4 years ago
I believe they are just lists of utterance ids. In my librispeech install, I found a bunch of files ending with .txt, that had utterance ids and transcriptions. This is how I generated the list files:
# Generates train.txt, eval.txt, validation.txt, which
# are just lists of utterance ids. This script looks
# at all the .txt files within LibriSpeech to extract
# the ids and write the files.
# An utterance id is a string like "61-70968-0009".
import os
trainroot = 'LibriSpeech/train-clean-100/' #, 'train-clean-360/', 'train-other-500/'
devroot = 'LibriSpeech/dev-clean/' #, 'LibriSpeech/dev-other/'
testroot = 'LibriSpeech/test-clean/'
def generate_list(root_dir, fn):
# get the utterance ids
utterance_ids = []
for subdir, _, files in os.walk(root_dir):
for filename in [f for f in files if f.endswith(".txt")]:
with open(os.path.join(subdir, filename)) as f:
ids = [l.split(" ")[0] + "\n" for l in f.readlines()]
utterance_ids.extend(ids)
# write them
with open(fn, "w") as of:
of.writelines(utterance_ids)
if __name__ == "__main__":
generate_list(trainroot, "LibriSpeech/list/train.txt")
generate_list(testroot, "LibriSpeech/list/eval.txt")
generate_list(devroot, "LibriSpeech/list/validation.txt")
Dataset is available at the website http://www.openslr.org/12/
I saw list files such as "LibriSpeech/list/train.txt" are required parameters for
main.py
. It seems such files are not provided by librispeech officially. What is the format of them? Could you provide them or the script to generate them?