Offer an alternative to Iterable Dataset that allows lazy loading and processing while skipping batches efficiently

Feature request

I would like a way to resume training from a checkpoint without waiting for a very long time when using an iterable dataset.

Motivation

I am training models on the speech-recognition task. I have very large datasets that I can't comfortably store on a disk and also quite computationally intensive audio processing to do. As a result I want to load data from my remote when it is needed and perform all processing on the fly.

I am currently using the iterable dataset feature of datasets. It does everything I need with one exception. My issue is that when resuming training at a step n, we have to download all the data and perform the processing of steps < n, just to get the iterable at the right step. In my case it takes almost as long as training for the same steps, which make resuming training from a checkpoint useless in practice.

I understand that the nature of iterators make it probably nearly impossible to quickly resume training.

I thought about a possible solution nonetheless :

I could in fact index my large dataset and make it a mapped dataset. Then I could use set_transform to perform the processing on the fly. Finally, if I'm not mistaken, the accelerate package allows to skip steps efficiently for a mapped dataset.

Is it possible to lazily load samples of a mapped dataset ? I'm used to dataset scripts, maybe something can be done there. If not, I could do it using a plain Pytorch dataset. Then I would need to convert it to a datasets' dataset to get all the features of datasets. Is it something possible ?

Your contribution

I could provide a PR to allow lazy loading of mapped dataset or the conversion of a mapped Pytorch dataset into a Datasets dataset if you think it is an useful new feature.

We plan to improve this eventually (see https://github.com/huggingface/datasets/issues/5454 and https://github.com/huggingface/datasets/issues/5380).

Is it possible to lazily load samples of a mapped dataset ? I'm used to dataset scripts, maybe something can be done there. If not, I could do it using a plain Pytorch dataset. Then I would need to convert it to a datasets' dataset to get all the features of datasets. Is it something possible ?

Yes, by creating a mapped dataset that stores audio URLs. Indexing a dataset in such format only downloads and decodes the bytes of the accessed samples (without storing them on disk).

You can do the following to create this dataset:


def gen():
    # Generator that yields (audio URL, text) pairs as dict
    ...
    yield {"audio": "audio_url", "text": "some text"}

features = Features({"audio": datasets.Audio(), "text": datasets.Value("string")})
ds = Dataset.from_generator(gen, features=features)
ds[2:5] # downloads and decodes the samples each time they are accessed

huggingface / datasets