lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
931 stars 213 forks source link

Repository with downloadable manifests for standard recipes #7

Open pzelasko opened 4 years ago

pzelasko commented 4 years ago

Dan's original comment:

I wonder if the next step could be creating an example setup for some dataset, e.g. mini_librispeech? I'm thinking for now we can provide scripts that create the manifest files; making them downloadable could be an optional next step.

I don't think we necessarily have to be too purist about this, in terms of having the scripts be just python; there may be some datasets where having shell scripts, like Kaldi does, or using multiple languages will be necessary. One possibility it to structure it a little like Kaldi, with different egs directories. We also don't have to have it in the lhotse repo if we don't feel that's right (but also including it is fine with me, I think). I am slightly concerned about versioning issues, if people make recipes based on datasets and we then change something, what happens.

I am thinking about this a little like the standard datasets available in things like PyTorch, where people import them into their own setups/repos.

Bear in mind that at some point we'll want to be writing and loading compressed features using lilcom. We could extract the features using kaldi10feat or some other method (I'm thinking of log-mel features). For now we can probably have one recording per file (?).

Before making an example script it's OK to work on the lhotse stuff for handling features, though. We should just be thinking about what the scripts will look like.

pzelasko commented 4 years ago

For some perspective, this is how some frameworks do it:

Keras:

from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

Pytorch (torchvision):

import torchvision
imagenet_data = torchvision.datasets.ImageNet('path/to/imagenet_root/')  # Pytorch Dataset

Tensorflow Datasets (tfds)

import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train', shuffle_files=True)  # Tensorfow Dataset

I'm thinking ideally we would follow a similar principle, with notable differences:

Very rough sketch:

from lhotse.datasets import librispeech

librispeech_manifests = librispeech.v1.load()  # enforce specific version

librispeech_manifests = librispeech.v1.load(root_path='/home/user/librispeech')  # modifies the paths in the manifests to reflect the local filesystem

librispeech_manifests = librispeech.v1.load(splits=['train-clean-100', 'dev-clean-100', 'test-clean-100'])  # request specific, standard splits of the dataset

In terms of where to keep the manifests, I'm not sure. Might be in this repo, but it will grow quickly... We could keep these in storage like S3 and download behind-the-scenes, maybe with an eventual support for community-provided "recipes" (see e.g. what Huggingface does with pretrained models).

danpovey commented 4 years ago

OK, cool.

I don't want to have the dataset be a too-monolithic thing. There would be separate manifest files for the audio vs. the supervision. The mapping onto a PyTorch dataset will be something that involves k2 in a nontrivial way (we'll probably convert the text to FSAs, for a start).

For now let's focus on getting to the point where we can have a script that dumps feature files compressed with lilcom, and then another script that's capable of loading the feature files and the corresponding supervisions. Remember that we want to be able to specify the amount of left and right acoustic context. We'll need a mechanism for padding in case we need more context than available. Could be at the log-mel level, perhaps, e.g. pad with the lowest energy frame that we have? Or some default low energy.

danpovey commented 4 years ago

And we definitely don't want to put the manifests in this repo. We could make a site alongside openslr, perhaps, maybe called lhotse-something.com (for now could use some other domain that I own, like annapovey.com). For now see if you can find somewhere on github that they can go, but not in that repo, maybe?