mozilla / jupyter-spark

Jupyter Notebook extension for Apache Spark integration

Mozilla Public License 2.0

193 stars 34 forks source link

Handle multiple Spark Sessions #40

Open bernhard-42 opened 6 years ago

bernhard-42 commented 6 years ago

Proposal for Issue 22:

In the Jupyter notebook a Jupyter Comm target gets opened to listen for messages from a python kernel. A new Jupyter Magic uses this comm target to forward the Spark API URL to the notebook:

%spark_progress spark

where spark is the variable holding the Spark Session, so the magic can use globals()["spark"].sparkContext.uiWebUrl to get the actual Spark API Url.

Each call from the javascript notebook then forwards the Spark API Url as a query parameter spark_url to the backend handler which uses it to create the backend_url.

This allows for multiple SparkContexts in different tabs and even for spark.ui.port=0 setting.

codecov-io commented 6 years ago

Codecov Report

Merging #40 into master will decrease coverage by 25.82%. The diff coverage is 35.13%.

@@             Coverage Diff             @@
##           master      #40       +/-   ##
===========================================
- Coverage   96.61%   70.78%   -25.83%     
===========================================
  Files           3        4        +1     
  Lines          59       89       +30     
  Branches        5       10        +5     
===========================================
+ Hits           57       63        +6     
- Misses          2       26       +24

Impacted Files	Coverage Δ
src/jupyter_spark/magic.py	`0% <0%> (ø)`
src/jupyter_spark/spark.py	`100% <100%> (ø)`	:arrow_up:
src/jupyter_spark/handlers.py	`100% <100%> (ø)`	:arrow_up:
src/jupyter_spark/__init__.py	`44.44% <25%> (-15.56%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 34ab4bf...38ed34a. Read the comment docs.

mdboom commented 6 years ago

Thanks for the contribution! I hope to have a deeper look early next week.

bernhard-42 commented 6 years ago

I guess I have changed the code accordingly. I personally don't really like using internal APIs, however I understand your rationale. I marked it with a TODO.

bernhard-42 commented 6 years ago

A side note: If you work on an Hadoop cluster (as I do, hence the yarn stuff last time) shooting against uiWebUrl means shooting twice a second against the Resource Manager. If many users do this at the same time, this might create quite some traffic. Maybe a less chatty approach would be to use sc.statusTracker in a background thread in the notebook triggered by Jupyter cell hooks and communicating the status to the notebook javascript via the Jupyter comm layer - just an idea ...

mdboom commented 6 years ago

Thanks. I'm sorry -- I think I wasn't clear earlier. If you grab the spark context from the singleton, then the magic is completely optional in the common case. You would only need to use the magic if you explicitly want to set the url. Would you mind updating this so the magic is optional (and users can continue working as they have been unless this additional complexity is needed for them...?)

mdboom commented 6 years ago

@bernhard-42 : Hope I didn't scare you off by creating confusion. Your contribution is very much appreciated.

bernhard-42 commented 6 years ago

No worries, first I didn't have time and then I forgot it ... Hope it now meets your expectations. If not, please feel free to accept and adapt as you need - this might actually be the faster process. I am happy either way.

ran-z commented 5 years ago

@mdboom Any news regarding this? Or any other alternative solution for working with this extension on multiple tabs (each with a different Spark context and kernel)?

stevenstetzler commented 5 years ago

Is there any update on these changes getting pulled into the main project, or updates otherwise? This functionality would be very, very useful and the lack of it is a major block to using this extension.