Open qzchenwl opened 5 years ago
@qzchenwl Hi, have you resolved this issue yet? here they suggested to use '--port=%i' % port
for docker command. I was thinking maybe we can pass it to hub's extra config so it chooses the port randomly.
@frouzbeh Not yet. The solution for DockerSpawner is not suitable for KubeSpawner, because Docker Spawner runs at the local machine, it can get usable port before it starts. For KubeSpawner, you don't know which host the container will be assigned, hence you don't know which port is usable beforehand.
@qzchenwl Well that's an issue and I hope somebody will take care of it.
@minrk, @yuvipanda Hi, don't you have any comment or solution on this issue?
Traditionally, when JupyterHub tries to find a 'random port', it finds a random port that is unused in the machine the JupyterHub process is running. That doesn't work here, since you'll need to find a random available port that isn't used in any of the machines running. I'm not entirely sure how to do that in a clean way.
Are there ways to run spark that don't require hostNetwork? Can the pod network range be directly reachable from spark?
@yuvipanda thanks, I'm not expert on Spark, but I can ask our administrator to see if it's possible, but can't we provide a range of ports so kubespawner selects randomly from that range?
Traditionally, when JupyterHub tries to find a 'random port', it finds a random port that is unused in the machine the JupyterHub process is running. That doesn't work here, since you'll need to find a random available port that isn't used in any of the machines running. I'm not entirely sure how to do that in a clean way.
Are there ways to run spark that don't require hostNetwork? Can the pod network range be directly reachable from spark?
@yuvipanda - Spark does not have a requirement that it needs to run on hostNetwork. But the Images in Docker Stacks of Jupyterhub has a requirement.
Caused by: java.io.IOException: Failed to connect to jupyter-doe-xxxxx:39003 Caused by: java.net.UnknownHostException: jupyter-doe-xxxxx
This is the error that spark throws when it tries to run a job when we disable hostNetwork. The jupyter-doe-xxxxx pod is basically the pod that is generated for the user. Since our Spark cluster also runs on K8s and since the jupyterhub pod is not in hostNetowork, it is not able to resolve the pod.
Can the jupyter-doe-xxxxx pod be made a Statefulset. We've generally seen these types of issue being solved. Not sure if it can be solved. But worth a try.
by the way, a similar patch recently got accepted by batchspawner: https://github.com/jupyterhub/batchspawner/pull/58 which has created some problems when interacting with other features, you can see issues in batchspawner. A similar thing could be used here.
But... in case that is used here, it may be time to add native support for this in JupyterHub. I think that would solve some of the subtle issues which we keep seeing... but I'm not able to do it myself.
Note to @cmd-ntrf who wrote it originally.
As @rkdarst mentionned, we encountered a similar issue with batchspawner.
The solution we opted for was to write a API Handler that is installed on the Hub side. The handler waits to receive the port number from the singleuser and modify the spawner port value. The spawner is identified based on the user auth, but I have recently submitted a patch to use the API token instead to support named servers.
To send the port, I have written a small wrapper script that selects a port, configures the singleuser to use it, send it through http to the Hub at the API handler address, then starts the notebook just like singleuser would.
There is a problem though. JupyterHub does not provide mechanism to automatically registers API handlers from third-parties. Currently, the API handler is registered when the batchspawner module is imported, but for some cases like when using wrapspawner, the module is imported after JupyterHub is initialized and the batchspawner API handler is not registered properly. As a solution to that problem, we currently instructing user to import batchspawner in jupyterhub_config.py, which is not ideal, but it works.
Ideally, the API handler I have written for batchspawner would be integrated directly in JupyterHub to configure the port number. Another option would be to implement a mechanism similar to the one in Jupyter that allow the installation and activation of server side plugin / handlers. I am willing to help with either solution or anything related.
@ramkrishnan8994 I haven't been able to connect to our Spark yarn cluster yet and I'm getting the following exception for both cases (enabling hostNetwork and disabling it):
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:514)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:307)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:827)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:46086
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:46086
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.net.ConnectException: Connection refused
... 11 more
For both cases, I can see my application has been accepted in yarn manager, and then after some seconds it's stopped.
@cmd-ntrf Apparently none of the developers are interested in resolving this issue, would you please give me some guidance about your solution? Can you share your solution with me? Thanks
@frouzbeh - Any solutions you were able to come up for this? We have dropped Jupyterhub because of this.
@rkdarst
@ramkrishnan8994 Well, that's crazy, because I thought I have to use host net, but without host net my spark works fine and now we don't have the port problem.
@frouzbeh How do you make spark work without host net?
@ramkrishnan8994 Well, that's crazy, because I thought I have to use host net, but without host net my spark works fine and now we don't have the port problem.
Are you connecting to local spark or Remote Spark. We connect to a Standalone Spark cluster and that requires hostNetwork to be enabled
@ramkrishnan8994 My Kubernetes and Hadoop cluster are physically on the same computer cluster. I thought to connect to spark from client side I needed hostNetwork but I don't. I just needed to set the spark.driver.host of SparkConf to the ip address of the container.
According to the document, KubeSpawner will use randomly allocated port. I deployed zero-to-jupyterhub-k8s (
hostNetwork:true
set for spark) and got error when login some users:[Warning] 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 Insufficient memory, 2 node(s) didn't have free ports for the requested pod ports.
That's because KubeSpawner always use port 8888 instead of random port. https://github.com/jupyterhub/kubespawner/blob/master/kubespawner/spawner.py#L145
https://jupyterhub-kubespawner.readthedocs.io/en/latest/spawner.html