[Data] Cache transformations without pre-materializing

Description

Currently there are two ways we can handle transformations performed by ray data:

Materialize ahead of time, then all downstream actions (multiple training epochs) will use the cached computation, but the downstream steps must wait for the full materialization to complete first.
Stream the computation, but that computation is then re-done on each training epoch.

This feature request is for a middle path, where ray data transformations are still streamed during a downstream task (e.g. training an epoch) but are then cached for future tasks (e.g. the next epoch).

See this conversation on Ray Slack.

Use case

We would like to cache our pre-processing during training without needing to wait for preprocessing to finish before training can start.

ray-project / ray

[Data] Cache transformations without pre-materializing #45042

Description

Use case