Open cadedaniel opened 1 year ago
This is great finding, we should use this as one of our baselines for improving worker pool hit rate
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
What happened + What you expected to happen
I was benchmarking a Ray Data program that loaded files from disk, preprocessed them with TensorFlow, then iterated over the preprocessed data. I would frequently find that the preprocessing throughput would drop quite a bit (by 50% or 60%), apparently at random.
Upon investigation, whenever this slowdown occurs there are TensorFlow logs indicating that TensorFlow is being imported. TensorFlow takes quite a bit of time to initialize. This lead us to conclude that sometimes Ray will create a new process instead of re-use the existing ones.
I used the following snippet to ensure enough warm Ray worker processes:
No runtime envs are being explicitly used. Notably, sometimes this would happen for most of the Ray Data tasks, other times it would only happen for one, and yet other times it would not happen at all.
Versions / Dependencies
Ray 2.2 TF 2.10
Reproduction script
The key code is here:
Entire code / instructions for running: https://github.com/anyscale/air-benchmarks/blob/data-perf-debug-cade/resnet50-train/data_preprocess.py#L587-L588
Issue Severity
Medium: It is a significant difficulty but I can work around it.