ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.5k stars 5.69k forks source link

Ray on Spark fractional GPU error calling setup_ray_cluster #39537

Closed Joseph-Sarsfield closed 3 months ago

Joseph-Sarsfield commented 1 year ago

What happened + What you expected to happen

  1. Bug spark.task.resource.gpu.amount does not support fractional GPU which is required for parallel spark jobs on GPU https://github.com/ray-project/ray/blob/master/python/ray/util/spark/cluster_init.py line 1026 num_spark_task_gpus = int( spark.sparkContext.getConf().get("spark.task.resource.gpu.amount", "0") )

  2. Ignore spark.task.resource.gpu.amount if num_spark_task_gpus passed manually

Versions / Dependencies

All versions

Reproduction script

Set spark.task.resource.gpu.amount to fractional value and num_gpus_worker_node to not None in call to setup_ray_cluster

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Joseph-Sarsfield commented 1 year ago

Hi @WeichenXu123 this is related to Ray on Spark

Joseph-Sarsfield commented 3 months ago

@jjyao do we have an update on this "spark.task.resource.gpu.amount" can legitimately be a decimal value and shouldn't be used to set num_gpus_worker_node

ValueError: invalid literal for int() with base 10: '0.5' https://github.com/ray-project/ray/blob/master/python/ray/util/spark/cluster_init.py#L1026C44-L1026C73

WeichenXu123 commented 3 months ago

Hi , RayonSpark haven't supported fractional GPU, we can support it if you need it.

Joseph-Sarsfield commented 3 months ago

@WeichenXu123 yes please, we have currently forked ray to bypass the exception.

WeichenXu123 commented 3 months ago

@Joseph-Sarsfield PR is out.