oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
293 stars 66 forks source link

Failed to initialize a local spark cluster with raydp #322

Closed garylavayou closed 1 year ago

garylavayou commented 1 year ago

Problem Description

I am trying to create a spark cluster with raydp, with the following code (from Ray's documentation):

ray.init(address='auto')
spark = raydp.init_spark(app_name='RayDP Example',
                         num_executors=2,
                         executor_cores=2,
                         executor_memory='1GB')

but failed with the direct error (for more details, see related logs).

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.deploy.raydp.RayAppMaster.setProperties.
: java.lang.NoClassDefFoundError: io/ray/runtime/config/RayConfig

Configurations

System: WSL2 Ubuntu 22.04.2 JDK version: openjdk8u362-b09 (eclipse temurin) or openjdk 11.0.18 (Ubuntu package). Python version: 3.10.10 Ray version: 2.2.0 raydp version: 1.5.0 pyspark version: 3.3.1 py4j version: 0.10.9.5

Related Logs

2023-03-27 18:00:38,690 WARNING worker.py:1851 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 823, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 875, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 830, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 834, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 780, in ray._raylet.execute_task.function_executor
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/raydp/spark/ray_cluster_master.py", line 56, in start_up
    self._gateway.jvm.org.apache.spark.deploy.raydp.RayAppMaster.setProperties(jvm_properties)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.deploy.raydp.RayAppMaster.setProperties.
: java.lang.NoClassDefFoundError: io/ray/runtime/config/RayConfig
    at org.apache.spark.deploy.raydp.RayAppMaster$.setProperties(RayAppMaster.scala:339)
    at org.apache.spark.deploy.raydp.RayAppMaster.setProperties(RayAppMaster.scala)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: io.ray.runtime.config.RayConfig
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    ... 13 more

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 1135, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 1045, in ray._raylet.execute_task_with_cancellation_handler
  File "python/ray/_raylet.pyx", line 782, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 945, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 589, in ray._raylet.store_task_errors
  File "python/ray/_raylet.pyx", line 2447, in ray._raylet.CoreWorker.store_task_outputs
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/_private/serialization.py", line 450, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/_private/serialization.py", line 405, in _serialize_to_msgpack
    value = value.to_bytes()
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/exceptions.py", line 32, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 627, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
An unexpected internal error occurred while the worker was executing a task.
2023-03-27 18:00:38,695 WARNING worker.py:1851 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff9d77052947212287a6fbd6b702000000 Worker ID: ec72451ae33b09f9c4d65e169a83b0ddd4c1e1cc985c304e60a49620 Node ID: 9c9a10476d91172e52952389c5ef92f8e39c6afa4bf5972244da1aa9 Worker IP address: 172.19.154.42 Worker port: 43883 Worker PID: 16545 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None.
 Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 823, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 875, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 830, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 834, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 780, in ray._raylet.execute_task.function_executor
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/raydp/spark/ray_cluster_master.py", line 56, in start_up
    self._gateway.jvm.org.apache.spark.deploy.raydp.RayAppMaster.setProperties(jvm_properties)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.deploy.raydp.RayAppMaster.setProperties.
: java.lang.NoClassDefFoundError: io/ray/runtime/config/RayConfig
    at org.apache.spark.deploy.raydp.RayAppMaster$.setProperties(RayAppMaster.scala:339)
    at org.apache.spark.deploy.raydp.RayAppMaster.setProperties(RayAppMaster.scala)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: io.ray.runtime.config.RayConfig
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    ... 13 more

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 1135, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 1045, in ray._raylet.execute_task_with_cancellation_handler
  File "python/ray/_raylet.pyx", line 782, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 945, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 589, in ray._raylet.store_task_errors
  File "python/ray/_raylet.pyx", line 2447, in ray._raylet.CoreWorker.store_task_outputs
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/_private/serialization.py", line 450, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/_private/serialization.py", line 405, in _serialize_to_msgpack
    value = value.to_bytes()
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/exceptions.py", line 32, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/gary/miniconda3/envs/dataproc/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 627, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
An unexpected internal error occurred while the worker was executing a task.
kira-lin commented 1 year ago

hi @garylavayou , The error shows that it cannot find a class in Ray. Can you please check if ray jar is presented under where-you-installed-ray/jars? You can check where you installed ray in python: import ray; ray.__file__

garylavayou commented 1 year ago

hi @garylavayou , The error shows that it cannot find a class in Ray. Can you please check if ray jar is presented under where-you-installed-ray/jars? You can check where you installed ray in python: import ray; ray.__file__

@kira-lin There are no jars in the ray package. Did you mean jars in raydp? I found some jars in the raydp package.

-rw-r--r-- 1 gary gary 276K Mar 26 22:47 raydp-1.5.0.jar
-rw-r--r-- 1 gary gary  16K Mar 26 22:47 raydp-shims-common-1.5.0.jar
-rw-r--r-- 1 gary gary  14K Mar 26 22:47 raydp-shims-spark321-1.5.0.jar
-rw-r--r-- 1 gary gary  15K Mar 26 22:47 raydp-shims-spark330-1.5.0.jar
kira-lin commented 1 year ago

@garylavayou , I mean the ray package.

There are no jars in the ray package.

Then that's the problem. Your installed ray package did not build the java part. How did you install it? I think it should contain the jar if you install it via pip. You can try pip install ray[default]. If it did not work, it might be due to your WSL environment. In that case, maybe you can try to build it from source

garylavayou commented 1 year ago

@garylavayou , I mean the ray package.

There are no jars in the ray package.

Then that's the problem. Your installed ray package did not build the java part. How did you install it? I think it should contain the jar if you install it via pip. You can try pip install ray[default]. If it did not work, it might be due to your WSL environment. In that case, maybe you can try to build it from source

@kira-lin I installed ray in a Conda virtual environment with the following packages.

   - polars
   - pandas
   - pandarallel
   - dask
   - ray-default==2.2
   - pylint
   - autopep8
   - ipykernel
   - rich
   - tqdm
   - watchdog
kira-lin commented 1 year ago

I see. Can you please try uninstall this and install it via pip, and see if the jar is presented?

nttg8100 commented 1 year ago

I have similar error. I use both pip install on conda env or pip in the bin system but both of them can not build the java part. Is there any solution ?

peterghaddad commented 1 year ago

Also experiencing this issue however ray_dist.jar exists in the jars directory. Running Ray 2.3.0 and raydp=1.5.0

garylavayou commented 1 year ago

I see. Can you please try uninstall this and install it via pip, and see if the jar is presented?

@kira-lin the problem is solved. I create the environment without ray and raydp (python=3.10), and then use pip to install ray[default]=2.2.0 and raydp.

the conda version is maintained by community, installing via pip is officially recommended.

garylavayou commented 1 year ago

Also experiencing this issue however ray_dist.jar exists in the jars directory. Running Ray 2.3.0 and raydp=1.5.0

@peterghaddad be sure to install compatible version of related packages.

From release notes of raydp 1.5: Support Ray 2.1.0 - 2.2.0 Support Spark 3.1 - 3.3

kira-lin commented 1 year ago

Also experiencing this issue however ray_dist.jar exists in the jars directory. Running Ray 2.3.0 and raydp=1.5.0

Are you experiencing the NPE exception? In that case please either use ray 2.2.0 or raydp-nightly, thanks