oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
304 stars 67 forks source link

raydp.init_spark error #66

Closed valiantljk closed 3 years ago

valiantljk commented 3 years ago

I'm following the example at: examples/pytorch/pytorch_nyctaxi.ipynb

Got this error when execute the raydp.init_spark

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-4-cb3a24832a0a> in <module>
      4 cores_per_executor = 1
      5 memory_per_executor = "2GB"
----> 6 spark = raydp.init_spark(app_name, num_executors, cores_per_executor, memory_per_executor)

~/anaconda3/lib/python3.8/site-packages/raydp/context.py in init_spark(app_name, num_executors, executor_cores, executor_memory, configs)
    120             _global_spark_context = _SparkContext(
    121                 app_name, num_executors, executor_cores, executor_memory, configs)
--> 122             return _global_spark_context.get_or_create_session()
    123         else:
    124             raise Exception("The spark environment has inited.")

~/anaconda3/lib/python3.8/site-packages/raydp/context.py in get_or_create_session(self)
     67         if self._spark_session is not None:
     68             return self._spark_session
---> 69         spark_cluster = self._get_or_create_spark_cluster()
     70         self._spark_session = spark_cluster.get_spark_session(
     71             self._app_name,

~/anaconda3/lib/python3.8/site-packages/raydp/context.py in _get_or_create_spark_cluster(self)
     61         if self._spark_cluster is not None:
     62             return self._spark_cluster
---> 63         self._spark_cluster = RayCluster()
     64         return self._spark_cluster
     65 

~/anaconda3/lib/python3.8/site-packages/raydp/spark/ray_cluster.py in __init__(self)
     37         super().__init__(None)
     38         self._app_master_bridge = None
---> 39         self._set_up_master(None, None)
     40         self._spark_session: SparkSession = None
     41 

~/anaconda3/lib/python3.8/site-packages/raydp/spark/ray_cluster.py in _set_up_master(self, resources, kwargs)
     43         # TODO: specify the app master resource
     44         self._app_master_bridge = RayClusterMaster()
---> 45         self._app_master_bridge.start_up()
     46 
     47     def _set_up_worker(self, resources: Dict[str, float], kwargs: Dict[str, str]):

~/anaconda3/lib/python3.8/site-packages/raydp/spark/ray_cluster_master.py in start_up(self, popen_kwargs)
     48         self._set_properties()
     49         self._host = ray.services.get_node_ip_address()
---> 50         self._create_app_master(extra_classpath)
     51         self._started_up = True
     52 

~/anaconda3/lib/python3.8/site-packages/raydp/spark/ray_cluster_master.py in _create_app_master(self, extra_classpath)
    156         if self._started_up:
    157             return
--> 158         self._app_master_java_bridge.startUpAppMaster(extra_classpath)
    159 
    160     def get_master_url(self):

~/anaconda3/lib/python3.8/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

~/anaconda3/lib/python3.8/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o0.startUpAppMaster.
: java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
    at io.ray.api.id.JobId.fromInt(JobId.java:47)
    at io.ray.runtime.gcs.GcsClient.nextJobId(GcsClient.java:186)
    at io.ray.runtime.RayNativeRuntime.start(RayNativeRuntime.java:105)
    at io.ray.runtime.DefaultRayRuntimeFactory.createRayRuntime(DefaultRayRuntimeFactory.java:38)
    at io.ray.api.Ray.init(Ray.java:42)
    at io.ray.api.Ray.init(Ray.java:28)
    at org.apache.spark.deploy.raydp.AppMasterJavaBridge.startUpAppMaster(AppMasterJavaBridge.scala:41)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
ConeyLiu commented 3 years ago

Hi @valiantljk, it seems like caused by the JDK 11. Are you using JDK 11?

valiantljk commented 3 years ago

Hi @ConeyLiu, I also found the cause related to jdk version. I switched to 11, and the issued was resolved. But got another issue when running spark_init

spark = raydp.init_spark(app_name, num_executors, cores_per_executor, memory_per_executor)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/dr6jl/anaconda3/lib/python3.8/site-packages/ray/jars/ray_dist.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/dr6jl/anaconda3/lib/python3.8/site-packages/pyspark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2020-11-19 21:58:25,757 ERROR JniUtils [Thread-1]: Failed to set library path.
java.lang.NoSuchFieldException: sys_paths
    at java.base/java.lang.Class.getDeclaredField(Class.java:2569)
    at io.ray.runtime.util.JniUtils.resetLibraryPath(JniUtils.java:78)
    at io.ray.runtime.util.JniUtils.loadLibrary(JniUtils.java:51)
    at io.ray.runtime.RayNativeRuntime.<clinit>(RayNativeRuntime.java:65)
    at io.ray.runtime.DefaultRayRuntimeFactory.createRayRuntime(DefaultRayRuntimeFactory.java:34)
    at io.ray.api.Ray.init(Ray.java:42)
    at io.ray.api.Ray.init(Ray.java:28)
    at org.apache.spark.deploy.raydp.AppMasterJavaBridge.startUpAppMaster(AppMasterJavaBridge.scala:41)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:832)
2020-11-19 21:58:25 ERROR JniUtils:82 - Failed to set library path.
java.lang.NoSuchFieldException: sys_paths
    at java.base/java.lang.Class.getDeclaredField(Class.java:2569)
    at io.ray.runtime.util.JniUtils.resetLibraryPath(JniUtils.java:78)
    at io.ray.runtime.util.JniUtils.loadLibrary(JniUtils.java:51)
    at io.ray.runtime.RayNativeRuntime.<clinit>(RayNativeRuntime.java:65)
    at io.ray.runtime.DefaultRayRuntimeFactory.createRayRuntime(DefaultRayRuntimeFactory.java:34)
    at io.ray.api.Ray.init(Ray.java:42)
    at io.ray.api.Ray.init(Ray.java:28)
    at org.apache.spark.deploy.raydp.AppMasterJavaBridge.startUpAppMaster(AppMasterJavaBridge.scala:41)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:832)
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1119 21:58:25.764730 95790 249954304 global_state_accessor.cc:25] Redis server address = 192.168.1.6:6379, is test flag = 0
I1119 21:58:25.766582 95790 249954304 redis_client.cc:146] RedisClient connected.
I1119 21:58:25.768702 95790 249954304 redis_gcs_client.cc:89] RedisGcsClient Connected.
I1119 21:58:25.769770 95790 249954304 service_based_gcs_client.cc:193] Reconnected to GCS server: 192.168.1.6:49821
I1119 21:58:25.770025 95790 249954304 service_based_accessor.cc:92] Reestablishing subscription for job info.
I1119 21:58:25.770033 95790 249954304 service_based_accessor.cc:422] Reestablishing subscription for actor info.
I1119 21:58:25.770040 95790 249954304 service_based_accessor.cc:797] Reestablishing subscription for node info.
I1119 21:58:25.770045 95790 249954304 service_based_accessor.cc:1073] Reestablishing subscription for task info.
I1119 21:58:25.770049 95790 249954304 service_based_accessor.cc:1248] Reestablishing subscription for object locations.
I1119 21:58:25.770052 95790 249954304 service_based_accessor.cc:1368] Reestablishing subscription for worker failures.
I1119 21:58:25.770058 95790 249954304 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/dr6jl/anaconda3/lib/python3.8/site-packages/pyspark/jars/spark-unsafe_2.12-3.0.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2020-11-19 21:58:26 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/Cellar/apache-spark/3.0.1/libexec/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" org.apache.spark.SparkException: Master must either be yarn or start with spark, mesos, k8s, or local
    at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:936)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:238)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dr6jl/anaconda3/lib/python3.8/site-packages/raydp/context.py", line 122, in init_spark
    return _global_spark_context.get_or_create_session()
  File "/Users/dr6jl/anaconda3/lib/python3.8/site-packages/raydp/context.py", line 70, in get_or_create_session
    self._spark_session = spark_cluster.get_spark_session(
  File "/Users/dr6jl/anaconda3/lib/python3.8/site-packages/raydp/spark/ray_cluster.py", line 72, in get_spark_session
    spark_builder.appName(app_name).master(self.get_cluster_url()).getOrCreate()
  File "/Users/dr6jl/anaconda3/lib/python3.8/site-packages/pyspark/sql/session.py", line 186, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/Users/dr6jl/anaconda3/lib/python3.8/site-packages/pyspark/context.py", line 371, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/Users/dr6jl/anaconda3/lib/python3.8/site-packages/pyspark/context.py", line 128, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/Users/dr6jl/anaconda3/lib/python3.8/site-packages/pyspark/context.py", line 320, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/Users/dr6jl/anaconda3/lib/python3.8/site-packages/pyspark/java_gateway.py", line 105, in launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
ConeyLiu commented 3 years ago

It should work fine with JDK 1.8. I am using 1.8.0_171. And bazel version is: 3.2.0

ConeyLiu commented 3 years ago

Our patch for ray has already merged into ray upstream. I will upgrade the ray version then those problems should go.

valiantljk commented 3 years ago

Our patch for ray has already merged into ray upstream. I will upgrade the ray version then those problems should go.

Thanks, That would be great!