ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[Data] Cache transformations without pre-materializing #45042

Open daturkel opened 4 months ago

daturkel commented 4 months ago

Description

Currently there are two ways we can handle transformations performed by ray data:

  1. Materialize ahead of time, then all downstream actions (multiple training epochs) will use the cached computation, but the downstream steps must wait for the full materialization to complete first.
  2. Stream the computation, but that computation is then re-done on each training epoch.

This feature request is for a middle path, where ray data transformations are still streamed during a downstream task (e.g. training an epoch) but are then cached for future tasks (e.g. the next epoch).

See this conversation on Ray Slack.

Use case

We would like to cache our pre-processing during training without needing to wait for preprocessing to finish before training can start.

hanif-rt commented 2 months ago

Any ideas if this is still in the roadmap?