ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.98k stars 5.77k forks source link

[Core | Ray on Spark] Allow cluster_mode='yarn' for Ray on Spark #45051

Open fersarr opened 6 months ago

fersarr commented 6 months ago

Description

I have been trying to run Ray on our YARN cluster (where we run Spark apps) and recently discovered that it might be easier to use Ray on Spark (https://docs.databricks.com/en/machine-learning/ray-integration.html). However, we use "spark.master": "yarn", and it throws an error, because Ray only allows Spark on standalone mode or local mode:

Traceback (most recent call last):
  File "rayplay.py", line 66, in <module>
    ray_on_spark()
  File "rayplay.py", line 57, in ray_on_spark
    setup_ray_cluster(
  File "/users/is/fsaraviara/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/util/spark/cluster_init.py", line 998, in setup_ray_cluster
    raise RuntimeError(
RuntimeError: Ray on Spark only supports spark cluster in standalone mode, local-cluster mode or spark local mode.

I have locally patched the if-statement in https://github.com/ray-project/ray/blame/a45bfe30bb2d190d40bfc5d2c6c97060800c1dc6/python/ray/util/spark/cluster_init.py#L960C22-L960C22 to allow "spark.master": "yarn", and everything seems to have worked fine. Was there a particular reason it was not allowed initially?

jjyao commented 6 months ago

cc @WeichenXu123 can you take a look at this?