ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.25k stars 5.81k forks source link

[Core] Worker reuse sometimes fails #30985

Open cadedaniel opened 1 year ago

cadedaniel commented 1 year ago

What happened + What you expected to happen

I was benchmarking a Ray Data program that loaded files from disk, preprocessed them with TensorFlow, then iterated over the preprocessed data. I would frequently find that the preprocessing throughput would drop quite a bit (by 50% or 60%), apparently at random.

Upon investigation, whenever this slowdown occurs there are TensorFlow logs indicating that TensorFlow is being imported. TensorFlow takes quite a bit of time to initialize. This lead us to conclude that sometimes Ray will create a new process instead of re-use the existing ones.

I used the following snippet to ensure enough warm Ray worker processes:

def warmup_tf_tasks(count):
    @ray.remote
    def warmup():
        import tensorflow as tf
        tf.print('Warming up task')

    print(f'Warming up with {count} tasks')
    start = time.time()
    ray.get([warmup.remote() for _ in range(count)])
    end = time.time()
    print(f'Done warming up in {end-start:0.2f}s') # typically takes 10-12s

No runtime envs are being explicitly used. Notably, sometimes this would happen for most of the Ray Data tasks, other times it would only happen for one, and yet other times it would not happen at all.

Versions / Dependencies

Ray 2.2 TF 2.10

Reproduction script

The key code is here:

ds = ray.data.read_tfrecords(filenames)
ds2 = ds.map_batches(decode_crop_and_flip_tf_record_batch, batch_size=32, batch_format="pandas", num_cpus=num_cpus)

Entire code / instructions for running: https://github.com/anyscale/air-benchmarks/blob/data-perf-debug-cade/resnet50-train/data_preprocess.py#L587-L588

Issue Severity

Medium: It is a significant difficulty but I can work around it.

clarng commented 1 year ago

This is great finding, we should use this as one of our baselines for improving worker pool hit rate

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.