tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks
Apache License 2.0
1.37k stars 391 forks source link

Limitation on Spark "task_gpu_amount" cannot be less than 1 #204

Open chenya-zhang opened 1 year ago

chenya-zhang commented 1 year ago

Hi folks, here is some context of the limitation we encountered.

  1. There is a check from "mirrored_strategy_runner.py" that "task_gpu_amount" cannot be less than 1. https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py#L161-L164
  2. "spark.task.resource.gpu.amount" can by default be set to a decimal amount per Nvidia's docs, https://www.nvidia.com/en-us/ai-data-science/spark-ebook/getting-started-spark-3/.
  3. There is an option in TensorFlow to set a fractional GPU amount to limit the memory usage: https://www.tensorflow.org/api_docs/python/tf/compat/v1/GPUOptions, https://github.com/tensorflow/tensorflow/issues/25138

In this case, does it make sense for Spark TensorFlow distributer to allow the GPU per task to be less than 1?