huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.98k stars 2.62k forks source link

Lookahead streaming support? #6120

Open PicoCreator opened 1 year ago

PicoCreator commented 1 year ago

Feature request

From what I understand, streaming dataset currently pulls the data, and process the data as it is requested. This can introduce significant latency delays when data is loaded into the training process, needing to wait for each segment.

While the delays might be dataset specific (or even mapping instruction/tokenizer specific)

Is it possible to introduce a streaming_lookahead parameter, which is used for predictable workloads (even shuffled dataset with fixed seed). As we can predict in advance what the next few datasamples will be. And fetch them while the current set is being trained.

With enough CPU & bandwidth to keep up with the training process, and a sufficiently large lookahead, this will reduce the various latency involved while waiting for the dataset to be ready between batches.

Motivation

Faster streaming performance, while training over extra large TB sized datasets

Your contribution

I currently use HF dataset, with pytorch lightning trainer for RWKV project, and would be able to help test this feature if supported.

mariosasko commented 1 year ago

In which format is your dataset? We could expose the pre_buffer flag for Parquet to use PyArrow's background thread pool to speed up loading.