Issues with running on Ray Client

wctmanager commented 1 year ago

Current documentation on using RayDP with Ray Client only syas: "RayDP works the same way when using ray client. However, spark driver would be on the local machine." It would be very helpful to have at least one example of how it should be configured and used, because implementing with Ray v2.1.0 and RayDP v1.5.0 (using Azure Kubernetes Service as backend) something straight forward as: _ray.init(address="ray://raycluster-kuberay-head-svc.default.svc.cluster.local:10001") spark = raydp.init_spark(app_name='RayDP Example', num_executors=1, executor_cores=1, executormemory='500M') df = spark.createDataFrame([('look',), ('spark',), ('tutorial',), ('spark',), ('look', ), ('python', )], ['word'])

Initializes Ray Client, RayDP/PySpark and creates the dataset without errors, but then

df.show()

creates an endless stream of

[Stage 0:> (0 + 0) / 1] 2023-06-04 22:18:13,383 WARN TaskSchedulerImpl [task-starvation-timer]: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2023-06-04 22:18:28,382 WARN TaskSchedulerImpl [task-starvation-timer]: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2023-06-04 22:18:43,382 WARN TaskSchedulerImpl [task-starvation-timer]: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Although an approach with wrapping "init_spark" into a ray actor works fine. Any advice or example to run remotely with Ray CLient will be highly appreciated. Thank you.

kira-lin commented 1 year ago

hi, I was not able to reproduce this issue in my environment. Maybe it's due to the network. As we said in the document, in ray client mode, spark executors will be in the ray cluster, but spark driver will be on the local machine where the script is run. Can that spark driver connect to those executors? Can you inspect the java-worker-*.log in /tmp/ray/session_latest/logs/?

https://github.com/oap-project/raydp/issues/299 This issue might be related. Are you using Mac for that local machine?

wctmanager commented 1 year ago

Thanks. Right, It's indeed the networking issue (between local spark driver and remote executor). The question is sooner -what are the network requirements to make driver-executor work fine (open ports, something else?) Thank you. P.S. By the way in this particular case both driver and executor are in the same k8s cluster, but different pods and namespaces.

kira-lin commented 1 year ago

I see. The driver node should have access to all ports on the executor nodes. I think this is enough

oap-project / raydp

Issues with running on Ray Client #352