Open bruno-hays opened 1 year ago
We plan to improve this eventually (see https://github.com/huggingface/datasets/issues/5454 and https://github.com/huggingface/datasets/issues/5380).
Is it possible to lazily load samples of a mapped dataset ? I'm used to dataset scripts, maybe something can be done there. If not, I could do it using a plain Pytorch dataset. Then I would need to convert it to a datasets' dataset to get all the features of datasets. Is it something possible ?
Yes, by creating a mapped dataset that stores audio URLs. Indexing a dataset in such format only downloads and decodes the bytes of the accessed samples (without storing them on disk).
You can do the following to create this dataset:
def gen():
# Generator that yields (audio URL, text) pairs as dict
...
yield {"audio": "audio_url", "text": "some text"}
features = Features({"audio": datasets.Audio(), "text": datasets.Value("string")})
ds = Dataset.from_generator(gen, features=features)
ds[2:5] # downloads and decodes the samples each time they are accessed
Feature request
I would like a way to resume training from a checkpoint without waiting for a very long time when using an iterable dataset.
Motivation
I am training models on the speech-recognition task. I have very large datasets that I can't comfortably store on a disk and also quite computationally intensive audio processing to do. As a result I want to load data from my remote when it is needed and perform all processing on the fly.
I am currently using the iterable dataset feature of datasets. It does everything I need with one exception. My issue is that when resuming training at a step n, we have to download all the data and perform the processing of steps < n, just to get the iterable at the right step. In my case it takes almost as long as training for the same steps, which make resuming training from a checkpoint useless in practice.
I understand that the nature of iterators make it probably nearly impossible to quickly resume training.
I thought about a possible solution nonetheless :
I could in fact index my large dataset and make it a mapped dataset. Then I could use set_transform to perform the processing on the fly. Finally, if I'm not mistaken, the accelerate package allows to skip steps efficiently for a mapped dataset.
Is it possible to lazily load samples of a mapped dataset ? I'm used to dataset scripts, maybe something can be done there. If not, I could do it using a plain Pytorch dataset. Then I would need to convert it to a datasets' dataset to get all the features of datasets. Is it something possible ?
Your contribution
I could provide a PR to allow lazy loading of mapped dataset or the conversion of a mapped Pytorch dataset into a Datasets dataset if you think it is an useful new feature.