jupyterhub / kubespawner

Kubernetes spawner for JupyterHub
https://jupyterhub-kubespawner.readthedocs.io
BSD 3-Clause "New" or "Revised" License
543 stars 304 forks source link

Support random port assignment when c.KubeSpawner.port = 0 #299

Open qzchenwl opened 5 years ago

qzchenwl commented 5 years ago

According to the document, KubeSpawner will use randomly allocated port. I deployed zero-to-jupyterhub-k8s (hostNetwork:true set for spark) and got error when login some users: [Warning] 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 Insufficient memory, 2 node(s) didn't have free ports for the requested pod ports.

That's because KubeSpawner always use port 8888 instead of random port. https://github.com/jupyterhub/kubespawner/blob/master/kubespawner/spawner.py#L145

https://jupyterhub-kubespawner.readthedocs.io/en/latest/spawner.html

config c.KubeSpawner.port = Int(0) The port for single-user servers to listen on.

Defaults to 0, which uses a randomly allocated port number each time.

If set to a non-zero value, all Spawners will use the same port, which only makes sense if each server is on a different address, e.g. in containers.

New in version 0.7.

frouzbeh commented 5 years ago

@qzchenwl Hi, have you resolved this issue yet? here they suggested to use '--port=%i' % port for docker command. I was thinking maybe we can pass it to hub's extra config so it chooses the port randomly.

qzchenwl commented 5 years ago

@frouzbeh Not yet. The solution for DockerSpawner is not suitable for KubeSpawner, because Docker Spawner runs at the local machine, it can get usable port before it starts. For KubeSpawner, you don't know which host the container will be assigned, hence you don't know which port is usable beforehand.

frouzbeh commented 5 years ago

@qzchenwl Well that's an issue and I hope somebody will take care of it.

frouzbeh commented 5 years ago

@minrk, @yuvipanda Hi, don't you have any comment or solution on this issue?

yuvipanda commented 5 years ago

Traditionally, when JupyterHub tries to find a 'random port', it finds a random port that is unused in the machine the JupyterHub process is running. That doesn't work here, since you'll need to find a random available port that isn't used in any of the machines running. I'm not entirely sure how to do that in a clean way.

Are there ways to run spark that don't require hostNetwork? Can the pod network range be directly reachable from spark?

frouzbeh commented 5 years ago

@yuvipanda thanks, I'm not expert on Spark, but I can ask our administrator to see if it's possible, but can't we provide a range of ports so kubespawner selects randomly from that range?

ramkrishnan8994 commented 5 years ago

Traditionally, when JupyterHub tries to find a 'random port', it finds a random port that is unused in the machine the JupyterHub process is running. That doesn't work here, since you'll need to find a random available port that isn't used in any of the machines running. I'm not entirely sure how to do that in a clean way.

Are there ways to run spark that don't require hostNetwork? Can the pod network range be directly reachable from spark?

@yuvipanda - Spark does not have a requirement that it needs to run on hostNetwork. But the Images in Docker Stacks of Jupyterhub has a requirement.

Caused by: java.io.IOException: Failed to connect to jupyter-doe-xxxxx:39003 Caused by: java.net.UnknownHostException: jupyter-doe-xxxxx

This is the error that spark throws when it tries to run a job when we disable hostNetwork. The jupyter-doe-xxxxx pod is basically the pod that is generated for the user. Since our Spark cluster also runs on K8s and since the jupyterhub pod is not in hostNetowork, it is not able to resolve the pod.

Can the jupyter-doe-xxxxx pod be made a Statefulset. We've generally seen these types of issue being solved. Not sure if it can be solved. But worth a try.

rkdarst commented 5 years ago

by the way, a similar patch recently got accepted by batchspawner: https://github.com/jupyterhub/batchspawner/pull/58 which has created some problems when interacting with other features, you can see issues in batchspawner. A similar thing could be used here.

But... in case that is used here, it may be time to add native support for this in JupyterHub. I think that would solve some of the subtle issues which we keep seeing... but I'm not able to do it myself.

Note to @cmd-ntrf who wrote it originally.

cmd-ntrf commented 5 years ago

As @rkdarst mentionned, we encountered a similar issue with batchspawner.

The solution we opted for was to write a API Handler that is installed on the Hub side. The handler waits to receive the port number from the singleuser and modify the spawner port value. The spawner is identified based on the user auth, but I have recently submitted a patch to use the API token instead to support named servers.

To send the port, I have written a small wrapper script that selects a port, configures the singleuser to use it, send it through http to the Hub at the API handler address, then starts the notebook just like singleuser would.

There is a problem though. JupyterHub does not provide mechanism to automatically registers API handlers from third-parties. Currently, the API handler is registered when the batchspawner module is imported, but for some cases like when using wrapspawner, the module is imported after JupyterHub is initialized and the batchspawner API handler is not registered properly. As a solution to that problem, we currently instructing user to import batchspawner in jupyterhub_config.py, which is not ideal, but it works.

Ideally, the API handler I have written for batchspawner would be integrated directly in JupyterHub to configure the port number. Another option would be to implement a mechanism similar to the one in Jupyter that allow the installation and activation of server side plugin / handlers. I am willing to help with either solution or anything related.

frouzbeh commented 5 years ago

@ramkrishnan8994 I haven't been able to connect to our Spark yarn cluster yet and I'm getting the following exception for both cases (enabling hostNetwork and disabling it):

org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:514)
        at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:307)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
        at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:827)
        at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:46086
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:46086
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        ... 1 more
Caused by: java.net.ConnectException: Connection refused
        ... 11 more

For both cases, I can see my application has been accepted in yarn manager, and then after some seconds it's stopped.

frouzbeh commented 5 years ago

@cmd-ntrf Apparently none of the developers are interested in resolving this issue, would you please give me some guidance about your solution? Can you share your solution with me? Thanks

ramkrishnan8994 commented 5 years ago

@frouzbeh - Any solutions you were able to come up for this? We have dropped Jupyterhub because of this.

@rkdarst

frouzbeh commented 5 years ago

@ramkrishnan8994 Well, that's crazy, because I thought I have to use host net, but without host net my spark works fine and now we don't have the port problem.

qzchenwl commented 5 years ago

@frouzbeh How do you make spark work without host net?

ramkrishnan8994 commented 5 years ago

@ramkrishnan8994 Well, that's crazy, because I thought I have to use host net, but without host net my spark works fine and now we don't have the port problem.

Are you connecting to local spark or Remote Spark. We connect to a Standalone Spark cluster and that requires hostNetwork to be enabled

frouzbeh commented 5 years ago

@ramkrishnan8994 My Kubernetes and Hadoop cluster are physically on the same computer cluster. I thought to connect to spark from client side I needed hostNetwork but I don't. I just needed to set the spark.driver.host of SparkConf to the ip address of the container.