I would like to be able to set a "global" (within the current DataContext) limit of allowed concurrency to limit resource usage.
Some of the Ray Data APIs support the concurrency parameter (ex. map_batches, flat_map, map_groups, etc.), but others don't (ex. sort). Based on the documentation (https://docs.ray.io/en/latest/data/performance-tips.html#configuring-resources-and-locality), I was expecting that setting ctx.execution_options.resource_limits.cpu would have the same effect as providing concurrency, but this doesn't seem to be the case. Moreover, some operations call other operations (ex. I believe that groupby would call sort), without any ability to provide such concurrency limits. Therefore, Ray would end up scheduling a task for every batch at the same time, leading to resource exhaustion.
Use case
The use case is very simple - prevent resource exhaustion when running certain Ray Data operations that currently don't support the concurrency limit.
Description
I would like to be able to set a "global" (within the current
DataContext
) limit of allowed concurrency to limit resource usage.Some of the Ray Data APIs support the
concurrency
parameter (ex.map_batches
,flat_map
,map_groups
, etc.), but others don't (ex.sort
). Based on the documentation (https://docs.ray.io/en/latest/data/performance-tips.html#configuring-resources-and-locality), I was expecting that settingctx.execution_options.resource_limits.cpu
would have the same effect as providingconcurrency
, but this doesn't seem to be the case. Moreover, some operations call other operations (ex. I believe thatgroupby
would callsort
), without any ability to provide suchconcurrency
limits. Therefore, Ray would end up scheduling a task for every batch at the same time, leading to resource exhaustion.Use case
The use case is very simple - prevent resource exhaustion when running certain Ray Data operations that currently don't support the
concurrency
limit.