oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
293 stars 66 forks source link

Cloudpickle errors with Ray 2.3.0 #316

Closed peterghaddad closed 1 year ago

peterghaddad commented 1 year ago

Upgrading to Ray 2.3.0 causing cloudpickle errors.

self._set_up_master(resources=self._get_master_resources(configs), kwargs=None)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/raydp/spark/ray_cluster.py", line 58, in _set_up_master
    ray.get(self._spark_master_handle.start_up.remote())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: RayDPSparkMaster
    actor_id: 133a1106ded55a2df7cccc5305000000
    pid: 7969
    name: spark-test_SPARK_MASTER
    namespace: 85dc7695-b493-44e6-acaf-71a164375d2c
    ip: 20.128.3.205
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. The worker may have exceeded K8s pod memory limits.
 Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/raydp/spark/ray_cluster_master.py", line 56, in start_up
    self._gateway.jvm.org.apache.spark.deploy.raydp.RayAppMaster.setProperties(jvm_properties)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.deploy.raydp.RayAppMaster.setProperties.
: java.lang.NullPointerException
    at java.util.Hashtable.put(Hashtable.java:460)
    at java.util.Properties.setProperty(Properties.java:166)
    at java.lang.System.setProperty(System.java:812)
    at org.apache.spark.deploy.raydp.RayAppMaster$.$anonfun$setProperties$1(RayAppMaster.scala:336)
    at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:400)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:728)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:728)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:728)
    at org.apache.spark.deploy.raydp.RayAppMaster$.setProperties(RayAppMaster.scala:335)
    at org.apache.spark.deploy.raydp.RayAppMaster.setProperties(RayAppMaster.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

During handling of the above exception, another exception occurred:

ray::PySparkApp.__init__() (pid=7774, ip=20.128.3.205, repr=<__main__.PySparkApp object at 0x7f09c003eaf0>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/exceptions.py", line 32, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps
    cp.dump(obj)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
An unexpected internal error occurred while the worker was executing a task.

This is produced when using the example in the Readme of the RayDP. I believe this is caused by Ray Core, but interested if others are experiencing the same issues with the upgrade. I also tested Python 3.9.

Environment Used:

@kira-lin curious if you had any thoughts on this. Thanks in advance.

peterghaddad commented 1 year ago

Looks like raydp-nightly fixes this problem.