Closed Joseph-Sarsfield closed 3 months ago
Hi @WeichenXu123 this is related to Ray on Spark
@jjyao do we have an update on this "spark.task.resource.gpu.amount" can legitimately be a decimal value and shouldn't be used to set num_gpus_worker_node
ValueError: invalid literal for int() with base 10: '0.5' https://github.com/ray-project/ray/blob/master/python/ray/util/spark/cluster_init.py#L1026C44-L1026C73
Hi , RayonSpark haven't supported fractional GPU, we can support it if you need it.
@WeichenXu123 yes please, we have currently forked ray to bypass the exception.
@Joseph-Sarsfield PR is out.
What happened + What you expected to happen
Bug spark.task.resource.gpu.amount does not support fractional GPU which is required for parallel spark jobs on GPU https://github.com/ray-project/ray/blob/master/python/ray/util/spark/cluster_init.py line 1026 num_spark_task_gpus = int( spark.sparkContext.getConf().get("spark.task.resource.gpu.amount", "0") )
Ignore spark.task.resource.gpu.amount if num_spark_task_gpus passed manually
Versions / Dependencies
All versions
Reproduction script
Set spark.task.resource.gpu.amount to fractional value and num_gpus_worker_node to not None in call to setup_ray_cluster
Issue Severity
Medium: It is a significant difficulty but I can work around it.