tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks
Apache License 2.0
1.37k stars 391 forks source link

[ML-10682] Make tensorflow-spark-distributor work for spark with GPU scheduling and CUDA_VISIBLE_DEVICES set in spark task #160

Closed WeichenXu123 closed 4 years ago

WeichenXu123 commented 4 years ago

Make tensorflow-spark-distributor work for spark with GPU scheduling and CUDA_VISIBLE_DEVICES set in spark task

In spark task, check env CUDA_VISIBLE_DEVICES first. If exists, then use the indices from TaskContext.resources() to slice CUDA_VISIBLE_DEVICES (instead of using all from CUDA_VISIBLE_DEVICES)

Test

Run

docker-compose build --build-arg PYTHON_INSTALL_VERSION=3.7
./tests/integration/run.sh
jhseu commented 4 years ago

Looks good to me, but you marked WIP in the pull request description. Is this ready to merge?

WeichenXu123 commented 4 years ago

This break test test_equal_gpu_allocation. It is weird. I am debugging..

WeichenXu123 commented 4 years ago

@mengxr @jhseu Ready.

WeichenXu123 commented 4 years ago

@mengxr Ready. About reusing pyspark worker issue, I discuss with @hyukjin offline, there're 2 approaches: 1 set spark.python.worker.reuse=false (this config is unchangeable in runtime) 2 in python remote function manually detect whether code run in reused worker, if yes raise error. (this is risky, we are hard to detect whether it is run on a reused worker) In this PR I use approach 1.