ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.91k stars 5.76k forks source link

[Core] It is not allowed to specify both num_cpus and num_gpus for map tasks #33908

Open v4if opened 1 year ago

v4if commented 1 year ago

What happened + What you expected to happen

It is not allowed to specify both num_cpus and num_gpus for map tasks. When only num_gpus is specified, num_cpus seems to be specified as 1 by default, actor pending due to insufficient cpu resources. However, gpu computing is often the performance bottleneck of the system. How to increase the concurrency of actors when gpu resources are still available?

ray status

 {'CPU': 1.0, 'GPU': 0.01}: 4+ pending tasks/actors

run log

Resource usage vs limits: 16.0/16.0 CPU, 0.2/1.0 GPU, 0.0 MiB/13.49 GiB object_store_memory 0:   0%|                                    | 0/1 [14:11<?, ?it/s]
ReadRange: 16 active, 8598 queued 1:  14%|██████████▊                                                                   | 1386/10000 [14:11<01:06, 130.05it/s]
MapBatches(ModelPredict): 30 active, 0 queued, 16 actors (4 pending) [0 locality hits, 1386 misses] 2:  14%|█▍         | 1356/10000 [14:25<1:08:53,  2.09it/s]
output: 0 queued 3:  14%|████████████▋                                                                                 | 1356/10000 [14:25<1:08:56,  2.09it/s]

Versions / Dependencies

ray, version 3.0.0.dev0

cluster_resources

{'memory': 256000000000.0, 'node:172.18.0.196': 1.0, 'object_store_memory': 57921323827.0, 'GPU': 1.0, 'accelerator_type:T4': 1.0, 'node:172.16.1.16': 1.0, 'CPU': 16.0}

Reproduction script

import ray
import time

class ModelPredict:
    def __call__(self, df):
        time.sleep(10)
        return df

ds = ray.data.range_table(10000, parallelism=10000)
ds = ds.map_batches(
    ModelPredict,
    # num_cpus=0.5,
    num_gpus=0.01,
    compute="actors",
    batch_size=1,
)
for batch in ds.iterator().iter_batches(batch_size=1):
    ...

Issue Severity

High: It blocks me from completing my task.

clarng commented 1 year ago

This seems to be data, since it is using dataset to specify resources, which it uses Ray core internally

choiikkyu commented 1 year ago

same issue with me. Did you solve it?

msminhas93 commented 1 year ago

Any update on this @hora-anyscale @clarng?

hora-anyscale commented 1 year ago

cc: @xieus

sdcope3 commented 10 months ago

Any progress on this issue? Is the implication that if num_gpus is defined, that the associated task is constrained to 1 CPU?

seastar105 commented 6 months ago

@raulchen Any progress on this issue? or any alternative method for fractional gpu and several cpu worker mapping?

danickzhu commented 4 months ago

@raulchen the GPU utilization is bottlenecked by the num_cpus (currently is 1) for the mapper task, do you have any suggestion?

Superskyyy commented 4 months ago

This is intentional behavior to avoid deadlocks I believe, but there could be workarounds. I'm planning to look into it in July.

pzdkn commented 2 months ago

Is it possible to use placement_groups here?

I tried:

predictions = ds_val.map_batches(predictor_cls,
scheduling_strategy=PlacementGroupSchedulingStrategy(ray.util.placement_group([{"CPU": 1}, {"GPU": 1}]  * num_workers, strategy="PACK")  , placement_group_capture_child_tasks=True))

It seems however, that the resources are not available to the actor.