Open bveeramani opened 2 weeks ago
@bveeramani why is back-pressure not kicking in here?
Calling repartition causes Ray Data to fall back to the old scheduling behavior
@alexeykudinkin Because of this
@bveeramani can you help me understand why we're falling back to old behavior then? Is it b/c of override_num_blocks
being used?
No, override_num_blocks
isn't relevant here.
All-to-all operations don't implement accurate memory accounting. So, if you have one in your DAG, it doesn't use the op reservation algorithm: https://github.com/ray-project/ray/blob/23cc23b7c295a2959682df785408e534095b2e19/python/ray/data/_internal/execution/resource_manager.py#L77-L84
In this case, it'll use the old scheduling behavior (_execution_allowed
) and pull as much data as possible from streaming generators: https://github.com/ray-project/ray/blob/23cc23b7c295a2959682df785408e534095b2e19/python/ray/data/_internal/execution/streaming_executor_state.py#L550-L555
What happened + What you expected to happen
See repro.
I'm trying to load Parquet files that contain image URIs, load those images, and write the loaded images to Parquet files. To ensure that Ray Data fully utilizes the cluster, I repartitioned the input data, but then I ran into a new problem.
Explanation of problem:
repartition
causes Ray Data to fall back to the old scheduling behaviorread_many_uris
tasksread_many_uris
, so Ray Data can't launch any write tasks to consume the loaded dataVersions / Dependencies
2.37
Reproduction script
Issue Severity
None