[Data] Ray Data runs out of disk while writing Parquet files

bveeramani commented 2 weeks ago

What happened + What you expected to happen

See repro.

I'm trying to load Parquet files that contain image URIs, load those images, and write the loaded images to Parquet files. To ensure that Ray Data fully utilizes the cluster, I repartitioned the input data, but then I ran into a new problem.

Explanation of problem:

Calling repartition causes Ray Data to fall back to the old scheduling behavior
Old scheduler launches too many read_many_uris tasks
All of the CPUs are occupied by read_many_uris, so Ray Data can't launch any write tasks to consume the loaded data
Ray eventually writes too many objects to disk and you run out of disk space

Versions / Dependencies

2.37

Reproduction script

import numpy as np

import ray

ray.init(num_cpus=8)

def read_many_uris(batch):
    for _ in range(1000):
        yield {"data": np.zeros((1, 128, 1024, 1024), dtype=np.uint8)}

(
    ray.data.range(8, override_num_blocks=1)
    .repartition(8)  # Repartition the data to ensure we can fully utilize the CPUs.
    .map_batches(read_many_uris, batch_size=1)
    .write_parquet("/tmp", ray_remote_args={"memory": 1024**3})
)