krishnan-r / sparkmonitor

Monitor Apache Spark from Jupyter Notebook
https://krishnan-r.github.io/sparkmonitor/
Apache License 2.0
172 stars 55 forks source link

Does this support multiple spark notebooks ? #8

Open AbdealiLoKo opened 6 years ago

AbdealiLoKo commented 6 years ago

The architecture at https://krishnan-r.github.io/sparkmonitor/how.html#the-notebook-webserver-extension---a-spark-web-ui-proxy seems to suggest that if I run multiple notebooks running spark its not going to work as only :4040 will be proxied

krishnan-r commented 6 years ago

No not in the current way it works. The original intention was to integrate this with a notebook service at CERN where only one instance of Spark would run inside a docker container. So at the moment only one notebook is supported. I think it can be made to support multiple notebooks with some changes. Would you be interested in collaborating on this?

AbdealiLoKo commented 6 years ago

Sure - I can collaborate. What are you thinking would be the best way to implement this ?

The SparkMonitorHandler's get() is the only place we need to know the correct PORT. Hence, we need the correct port when doing the GET request to /sparkmonitor endpoint.

It may be easier to create a new "kernel" called SparkMonitorKernel which created the pyspark session, sets the correct configurations, and gets the SparkUI port

krishnan-r commented 6 years ago

One suggestion would be to generate a random free port number when the kernel extension creates the configuration object, and set the property spark.ui.port on the conf object. This way each Spark instance runs the UI on a different port and the extension knows it.

We could default to the environment variable SPARKMONITOR_UI_PORT or 4040 when it is declared (currently how it works), so that applications that require a fixed port still work.(such as when running a proxy maybe...).

The kernel could in some way alert the server-extension of this, possibly through the front-end, which can add a GET parameter or send a message beforehand.

In my experience accessing the SparkContext object from the kernel extension causes some nasty errors. Such as when using getOrCreate from an extension might end up creating a session without the user starting one. We do not know when the user creates a session or what its named in the python namespace. As such we are unable to touch the SparkContext, giving the user more freedom to start/stop and configure it as required.

About creating a custom kernel:

PerilousApricot commented 5 years ago

Hello, I've begun the process to allow the notebook to be able to directly query the sparkcontext for the web URL:

http://apache-spark-developers-list.1001551.n3.nabble.com/SparkContext-singleton-get-w-o-create-td24653.html

This would let users use this extension w/o any special configuration or modification of their scripts. We can poll the singleton periodically to see when the context starts, then query the context for the correct web URL connection string.

Ben-Epstein commented 4 years ago

Hello, I've modified the SparkMonitor to work with Multiple Spark Sessions here: https://github.com/ben-epstein/sparkmonitor

@krishnan-r If you'd like to merge it into your repo just let me know.

For anyone interested in using it, you can install it with

pip install sparkmonitor-s==0.0.13
jupyter nbextension install sparkmonitor --py --user --symlink 
jupyter nbextension enable sparkmonitor --py --user            
jupyter serverextension enable --py --user sparkmonitor
ipython profile create && echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >>  $(ipython profile locate default)/ipython_kernel_config.py

If you've already installed the original sparkmonitor, you're going to have to remove it as well as the jupyter extension (which I'm not actually sure how to do...). If you're running it in a docker image just rebuild with the new pip module. If you're running locally I'm unsure. In order for these changes to take effect, you need to fully remove the old extension and then enable this one. If you just want to test it, you can clone the repo and run

docker build -t sparkmonitor .
docker run -it -p 8888:8888 sparkmonitor