Open pzelasko opened 4 years ago
For some perspective, this is how some frameworks do it:
Keras:
from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
Pytorch (torchvision):
import torchvision
imagenet_data = torchvision.datasets.ImageNet('path/to/imagenet_root/') # Pytorch Dataset
Tensorflow Datasets (tfds)
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train', shuffle_files=True) # Tensorfow Dataset
I'm thinking ideally we would follow a similar principle, with notable differences:
Very rough sketch:
from lhotse.datasets import librispeech
librispeech_manifests = librispeech.v1.load() # enforce specific version
librispeech_manifests = librispeech.v1.load(root_path='/home/user/librispeech') # modifies the paths in the manifests to reflect the local filesystem
librispeech_manifests = librispeech.v1.load(splits=['train-clean-100', 'dev-clean-100', 'test-clean-100']) # request specific, standard splits of the dataset
In terms of where to keep the manifests, I'm not sure. Might be in this repo, but it will grow quickly... We could keep these in storage like S3 and download behind-the-scenes, maybe with an eventual support for community-provided "recipes" (see e.g. what Huggingface does with pretrained models).
OK, cool.
I don't want to have the dataset be a too-monolithic thing. There would be separate manifest files for the audio vs. the supervision. The mapping onto a PyTorch dataset will be something that involves k2 in a nontrivial way (we'll probably convert the text to FSAs, for a start).
For now let's focus on getting to the point where we can have a script that dumps feature files compressed with lilcom, and then another script that's capable of loading the feature files and the corresponding supervisions. Remember that we want to be able to specify the amount of left and right acoustic context. We'll need a mechanism for padding in case we need more context than available. Could be at the log-mel level, perhaps, e.g. pad with the lowest energy frame that we have? Or some default low energy.
And we definitely don't want to put the manifests in this repo. We could make a site alongside openslr, perhaps, maybe called lhotse-something.com (for now could use some other domain that I own, like annapovey.com). For now see if you can find somewhere on github that they can go, but not in that repo, maybe?
Dan's original comment: