jupyterhub / zero-to-jupyterhub-k8s

Helm Chart & Documentation for deploying JupyterHub on Kubernetes
https://zero-to-jupyterhub.readthedocs.io
Other
1.55k stars 796 forks source link

Jupyterhub with Remote Spark Master NOT WORKING #1220

Closed ramkrishnan8994 closed 4 years ago

ramkrishnan8994 commented 5 years ago

Hi, I'm using Jupyterhub v0.7 and using the helm charts to deploy the same. We have a Spark Cluster that is also running on the same kubernetes cluster as Jupyterhub. We use the 'all-spark-notebook' from Docker stacks for single user images.

I know that one of the requirement for Jupyter to run using Docker stacks is that the hostNetwork has to be set to True.

Now, if I set hostNetwork to True, I can't spawn more than 1 Jupyter user instance because the port 8888 has already been assigned to the first instance. The new instance fails due to port conflict on 8888.

Now, if I set hostNetwork to False, I am able to spawn multiple user instances, BUT, since we connect the jupyter notebooks to a remote Spark Master, the Spark Master/Cluster is not able to resolve the host of the User Jupyter instance (which is the driver for the application). This is the error in Spark Master:

Caused by: java.io.IOException: Failed to connect to jupyter-doe-xxxxx:39003 Caused by: java.net.UnknownHostException: jupyter-doe-xxxxx

jupyter-doe-xxxxx is the name of the pod that is spawned for the user doe.

This is a link to a similar issue faced in DockerSpawner: https://github.com/jupyter/docker-stacks/issues/187#issuecomment-212448091

How can we solve this issue? We want all the applications to connect to a Remote Spark Cluster

ramkrishnan8994 commented 5 years ago

@consideRatio @manics Any solutions to this?

manics commented 5 years ago

Sounds like this issue is related to https://github.com/jupyterhub/kubespawner/issues/299

kevin-bates commented 5 years ago

One solution would be to insert Jupyter Enterprise Gateway (EG) between your spawned Notebook servers and your kernels. EG would then launch your spark-based kernels in "cluster mode" via k8s-spark - with a pod dedicated to the driver and executors. It will also launch vanilla kernels across the k8s cluster as well - each kernel in its own pod.

See this blog post by @lresende for setting this up.

ramkrishnan8994 commented 5 years ago

One solution would be to insert Jupyter Enterprise Gateway (EG) between your spawned Notebook servers and your kernels. EG would then launch your spark-based kernels in "cluster mode" via k8s-spark - with a pod dedicated to the driver and executors. It will also launch vanilla kernels across the k8s cluster as well - each kernel in its own pod.

See this blog post by @lresende for setting this up.

Thanks @kevin-bates . But we were looking for a simpler solution that does not involve adding more components.

fbalicchia commented 4 years ago

Hi, if this we can be useful to someone thanks to point suggest from @consideRatio and @abinet in link in extraConfig: in hub definition and adding echo "spark.driver.host $MY_POD_IP" >> "/usr/local/spark/conf/spark-defaults.conf"; in lifecycleHooks in singleuser definition the problem is solve. Can seen as ugly approached but at the moment of writing it works for me

consideRatio commented 4 years ago

Thank you everyone for helping out, I'm now closing this issue as there i identify no concrete action related to this github repository!

pslijkhuis commented 2 years ago

https://github.com/jupyterhub/kubespawner/pull/229

https://github.com/alagrede/jupyter-spark/blob/master/python3/spark-example.ipynb

This fixed it for me. Translate the driver hostname to an ip.