Lookahead streaming support?

Feature request

From what I understand, streaming dataset currently pulls the data, and process the data as it is requested. This can introduce significant latency delays when data is loaded into the training process, needing to wait for each segment.

While the delays might be dataset specific (or even mapping instruction/tokenizer specific)

Is it possible to introduce a streaming_lookahead parameter, which is used for predictable workloads (even shuffled dataset with fixed seed). As we can predict in advance what the next few datasamples will be. And fetch them while the current set is being trained.

With enough CPU & bandwidth to keep up with the training process, and a sufficiently large lookahead, this will reduce the various latency involved while waiting for the dataset to be ready between batches.

Motivation

Faster streaming performance, while training over extra large TB sized datasets

Your contribution

I currently use HF dataset, with pytorch lightning trainer for RWKV project, and would be able to help test this feature if supported.

huggingface / datasets