jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
623 stars 222 forks source link

Unable to execute pyspark ipynb file using pytest-ipynb package #799

Closed aniket02k closed 4 years ago

aniket02k commented 4 years ago

Description

Hi, I have created a spark python ipynb file through jupyterhub UI, in which I've added an example for writing to hdfs. I am able execute the example through UI. But, when I am trying to execute the same ipynb file using pytest-ipynb package using the command : !pytest -v /home/aniket/mnt/test.ipynb , observed the below error:

Traceback:

Py4JJavaError Traceback (most recent call last)

in 1 "test_hdfs_access" 2 from pyspark.sql import SparkSession ----> 3 spark = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate() 4 data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)] 5 df = spark.createDataFrame(data) /opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py in getOrCreate(self) 171 for key, value in self._options.items(): 172 sparkConf.set(key, value) --> 173 sc = SparkContext.getOrCreate(sparkConf) 174 # This SparkContext may be an existing one. 175 for key, value in self._options.items(): /opt/spark/python/lib/pyspark.zip/pyspark/context.py in getOrCreate(cls, conf) 365 with SparkContext._lock: 366 if SparkContext._active_spark_context is None: --> 367 SparkContext(conf=conf or SparkConf()) 368 return SparkContext._active_spark_context 369 /opt/spark/python/lib/pyspark.zip/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls) 134 try: 135 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer, --> 136 conf, jsc, profiler_cls) 137 except: 138 # If an error occurs, clean up in order to allow future SparkContext creation: /opt/spark/python/lib/pyspark.zip/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls) 196 197 # Create the Java SparkContext through Py4J --> 198 self._jsc = jsc or self._initialize_context(self._conf._jconf) 199 # Reset the SparkConf to the one actually used by the SparkContext in JVM. 200 self._conf = SparkConf(_jconf=self._jsc.sc().conf()) /opt/spark/python/lib/pyspark.zip/pyspark/context.py in _initialize_context(self, jconf) 304 Initialize SparkContext in function to allow subclass specific initialization 305 """ --> 306 return self._jvm.JavaSparkContext(jconf) 307 308 @classmethod /opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args) 1523 answer = self._gateway_client.send_command(command) 1524 return_value = get_return_value( -> 1525 answer, self._gateway_client, None, self._fqn) 1526 1527 for temp_arg in temp_args: /opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:423) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) py4j.Gateway.invoke(Gateway.java:238) py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) py4j.GatewayConnection.run(GatewayConnection.java:238) java.lang.Thread.run(Thread.java:748) at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2483) at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2$adapted(SparkContext.scala:2479) at scala.Option.foreach(Option.scala:274) at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2479) at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2568) at org.apache.spark.SparkContext.(SparkContext.scala:85) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Example which I am trying to run: ![image](https://user-images.githubusercontent.com/24222712/78635508-04f04600-78c4-11ea-98e7-f26208d2d57a.png) **Is there any another way to execute the pyspark ipynb file using any api or command?** ## Environment - Enterprise Gateway Version: 2.0.0rc1 - Platform: Kubernetes - Jupyter version: 0.9.6 - nb2kg Version =0.6.0
kevin-bates commented 4 years ago

Hi @aniket02k - thank you for opening this issue.

Although you list Enterprise Gateway and nb2kg, I'm not sure if you're conflating the presence of Gateway in the traceback information with Enterprise Gateway. The py4j.Gateway stuff is a Spark thing and nothing related to EG.

Which kernel image are you using for the kernel you're launching in kubernetes? I would recommend using elyra/kernel-spark-py or a derivation thereof for work in Spark since the launcher will automatically create the SparkContext for you.

Also, is this issue only happening when shell escaping !pytest -v /home/aniket/mnt/test.ipynb, yet does not occur when running the same code within the notebook cell?

Now that 2.0.0 (and 2.1.0) is available, I would recommend moving to that. In addition, if your Notebook server is >= 6.0, NB2KG is built into Notebook and is no longer necessary. Simply invoke Notebook with --gateway-url=<URL to EG instance> (among other options if necessary) and you're good to go.

suryag10 commented 4 years ago

[Surya] Thx for the reply kevin. Answers inline.

Hi @aniket02k - thank you for opening this issue.

Although you list Enterprise Gateway and nb2kg, I'm not sure if you're conflating the presence of Gateway in the traceback information with Enterprise Gateway. The py4j.Gateway stuff is a Spark thing and nothing related to EG.

Which kernel image are you using for the kernel you're launching in kubernetes? I would recommend using elyra/kernel-spark-py or a derivation thereof for work in Spark since the launcher will automatically create the SparkContext for you. [Surya] Yes we are using a derivative of same.

Also, is this issue only happening when shell escaping !pytest -v /home/aniket/mnt/test.ipynb, yet does not occur when running the same code within the notebook cell? [Surya] Correct, happens only with pytest.

Now that 2.0.0 (and 2.1.0) is available, I would recommend moving to that. In addition, if your Notebook server is >= 6.0, NB2KG is built into Notebook and is no longer necessary. Simply invoke Notebook with --gateway-url=<URL to EG instance> (among other options if necessary) and you're good to go. [Surya] we are in planning phase to upgrade to same

aniket02k commented 4 years ago

Also, is this issue only happening when shell escaping !pytest -v /home/aniket/mnt/test.ipynb, yet does not occur when running the same code within the notebook cell?

[Aniket] This issue is also observed when I tired to run pyspark notebook using papermill package.

!papermill /home/aniket/mnt/test.ipynb /opt/spark/work-dir/output.ipynb -p a 9 -k python3

After running the above command , observed the same traceback.

kevin-bates commented 4 years ago

[Aniket] This issue is also observed when I tired to run pyspark notebook using papermill package.

!papermill /home/aniket/mnt/test.ipynb /opt/spark/work-dir/output.ipynb -p a 9 -k python3

After running the above command , observed the same traceback.

Thanks @aniket02k. Yeah, my feeling is that this is more of an environmental thing relative to the spark environment, particularly since it can be reproduced w/o Enterprise Gateway entirely. As such, I'm inclined to close this issue.

I find shell-escaping out of a cell to run pytest very strange anyway. I suspect this is causing conflicts and confusion in spark and the "parent" context (from which the shell escape is taking place).

kevin-bates commented 4 years ago

Based on the information (and lack of response), I'm going to close this issue. We can re-open if that proves necessary.