swan-cern / sparkmonitor

An extension for Jupyter Lab & Jupyter Notebook to monitor Apache Spark (pyspark) from notebooks
https://pypi.org/project/sparkmonitor/
Apache License 2.0
45 stars 8 forks source link

Explore an alternative approach to Spark UI Proxy #3

Open krishnan-r opened 3 years ago

krishnan-r commented 3 years ago

Explore using https://github.com/jupyterhub/jupyter-server-proxy or another generic approach to provide the Spark UI through a proxy.

The current approach is brittle as it works only on localhost and is hardcoded. (This is currently removed in the refactor #1, will be added back.).

In our current deployment, we rely on https://github.com/swan-cern/jupyter-extensions/tree/master/SparkConnector as an external link. (this requires to be in the same network)

berglh commented 5 months ago

I just want to chime in on this one. After seeing this issue, I have used the jupyter-server-proxy extension along with the jupyter-app-launcher extension to create a launcher button to launch a Jupyter workspace tab to load the Spark UI in the JupyterLab environment.

JupyterLab 3.6.2 Spark 3.4.0

This worked pretty well and we could see most of the UI. The only thing not working was the executors page. Even when I configure the Spark UI to use the appropriate base path, it just didn't seem to propagate through correctly to the executor details page to load the executor details.

spark.ui.proxyBase: /proxy/4040

If I then forward the port from the Kubernetes pod and connect directly to the Spark UI, everything was working as expected - so there is some minor unexpected behaviour with the jupyter-server-proxy plugin and the Spark UI - I suspect this was a unexpected behaviour from the Spark UI itself and may work in newer versions. I'm in the process of testing Spark 3.5.0.

The other thing to note is as per the same issues with the sparkmonitor UI connector, was that it is possible to start multiple Spark sessions from a single JupterLab notebook. In this case, the UI port will increment monotonically from 4040 on to 4041 and the app launcher icon fails to connect to any additional instances. Our master Pyspark instances are run in the same container/pod as JupyterLab, rather than spawning the master in a new pod in Kubernetes, and this resulted in a half working solution (better than no Spark UI).

There does appear to be an attribute in the sparkContext to get the URL of the UI: spark.sparkContext.uiWebUrl. The problem in the context of Docker is that it returns the container uid as the host address and is not routable in a development environment. My guess is there will be some cases where it's not possible to proxy the UI via jupyter-server-proxy reliably depending on the network configuration and environment of the Spark cluster configuration.