jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

Trouble connecting worker to a running jupyterhub #186

Closed nikl11 closed 3 years ago

nikl11 commented 4 years ago

Hello, I seem to have difficulty running notebook remotely and connecting them to jupyterhub. Let me explain my setup and what I am trying to achieve. I have a jupyterhub server running on my laptop (in kubernetes, but that is not important imo) and I want to spawn notebooks on a remote supercomputer through PBS. On my laptop I can run a 'qsub' command to submit PBS jobs, and I want the jupyterhub (after logging in) to submit a qsub command, which reserves a worker node, the worker node starts a jupyter notebook that connects to my jupyterhub (running on a laptop), and on my laptop I am then able to work in the notebook and any program I create (lets say in C++) in said notebook is run on the allocated PBS node. Seems pretty straight forward.

What I struggle with is what I actually put into the PBS script to start the notebook and connect it to jupyterhub? Here in batchspawner.py is a basic PBS script that has {cmd} in it, which unwinds as batchspawner-singleuser jupyterhub-singleuser --ip="0.0.0.0", which makes no sence, there is no such command.

I Installed jupyter notebook (and also jupyterhub, although I am pretty sure that should not be necessary) on the PBS cluster locally to my home directory as 'pip install jupyterhub notebook --user'. But what commands do I put into the qsub command that jupyterhub initiates? My idea is something like 'jupyter-notebook --port=8888', but I have no idea how to tell the notebook where the hub is, lets say my laptop's public ip is 1.2.3.4 and the hub is running on port 8000, do I make an ssh tunnel between laptop and the worker node? Or how do I start a remote notebook and connect it to a running jupyterhub that is in a state "Job running, connecting to a notebook..."? I am not sure if I have given a good enough explanation of my struggles or enough details, but the post is long enough already so I will add the details when somebody asks. Thanks a lot for your help!

welcome[bot] commented 4 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

rkdarst commented 4 years ago

Hi,

From what I understand:

First off:

To answer your exact question:

I'll leave the answer at that and wait for your feedback to understand what you want to do, and can answer more if needed.

nikl11 commented 4 years ago

Thank you for your answer @rkdarst , I have actually managed to make some progress since the last post, but so far no cigar. Just to clarify some things you mentioned (the post seems long but please bare with me, it will all be clear in a minute): 1) The Jupyterhub is actually not meant to be used just by me, we want to offer it to users of our small cluster as a free alternative to paid nodes with jupyterhub from google and similar. The jupyterhub will run on one of our cloud virtual nodes with kubernetes that user can connect to via kerberos authentication and has the ability to submit job via PBS's qsub to our compute nodes. I said it runs on my laptop because I wanted to simplify things, the whole setup is a little bit more complicated, as I will explain lower. 2) I have made a mistake by instaling batchspawner via pip install batchspawner, which actually installs some archaic version that does not have PBSSpawner and batchspawner-singleuser. So once I figured this was the issues and donwloaded the latest master barnch and installed it, I moved forward. 3) Right now I am stuck at this problem: I installed jupyterhub, jupyter notebook and batchspawner on the machine running jupyterhub and on the network disk into my home directory where all compute nodes have an access to. I use PBSSpawner and basically use it the same way it is already written, in the qsub script I left the {cmd} and it runs. But now the problem is that the hub is stuck on "job running, connecting..." and notebook reports it cant update the hub: [I 2020-07-31 18:32:46.760 SingleUserNotebookApp singleuser:561] Starting jupyterhub-singleuser server version 1.1.0 [I 2020-07-31 18:32:46.822 SingleUserNotebookApp notebookapp:1924] Serving notebooks from local directory: /auto/brno6/home/fsbrno2/beda [I 2020-07-31 18:32:46.823 SingleUserNotebookApp notebookapp:1924] The Jupyter Notebook is running at: [I 2020-07-31 18:32:46.823 SingleUserNotebookApp notebookapp:1924] http://adan45.grid.cesnet.cz:42095/user/beda/ [I 2020-07-31 18:32:46.823 SingleUserNotebookApp notebookapp:1925] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [I 2020-07-31 18:32:46.891 SingleUserNotebookApp singleuser:542] Updating Hub with activity every 300 seconds [E 2020-07-31 18:33:06.913 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity Traceback (most recent call last): File "/storage/brno2/home/beda/.local/lib/python3.6/site-packages/jupyterhub/singleuser.py", line 521, in notify await client.fetch(req) tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting

I even tried to make an SSH tunnel to connect the hub with notebooks through their api port 8081 as ssh ... -L 8081:localhost:8081 ... user@ip_of_jupyterhub_node, where the ip_of_jupyterhub_node is a public IP with which is allowed in firewall to communicate on ports 443 (https port for users to connect to jupyterhub in a web browser) and 8081 (api for jupyterhub). If I dont make this ssh tunnel, notebook crashes on the second line that it cant find a hub to connect to (the error output above is with the tunnel, and the tunnel reports there is communication happening on port 8081). With the ssh tunnel it finds the hub, but fails to update it, which is frankly very weird to me, how can you make a successful handshake but not be able to communicate any further? One more note, the hub runs in a kubernetes pod, and the pod has both the 443 and 8081 ports forwarded to the host, so whenever user connect to jupyterhub using their browser, the connection is forwarded to the pod with running jupyterhub. This works fine, I have already managed to provide users with a jupyterhub that run on 4 virtual nodes with a basic spawner serviced by kubernetes, but now i would like for the users to be able to work on our gpu nodes and nodes with more memory, so I thought a batchspawner is the perfect option. Thank a lot!

nikl11 commented 4 years ago

Hey, I am still stuck at the same issue with "Error notifying hub of activity", does anybody have any ideas? Thanks a lot!

nikl11 commented 4 years ago

Solution found (FINALLY): The problem was in batchspawner.py. In PBSSpawner there is a line that extracts the hostname of where the notebook runs state_exechost_re = Unicode(r'exec_host = ([\w_-]+)/').tag(config=True)

It extracts it from job id info, which on PBS is done by qstat -fx {jobID}, and my output of the exec_host part looks like exec_host = konos5/4 exec_host2 = konos5.fav.zcu.cz:15002/4

The current regex filters only the konos5 from the first line, but I need the whole hostname konos5.fav.zcu.cz. So I fixed the regex to state_exechost_re = Unicode(r'exec_host2 = ([\w\._-]+):').tag(config=True) (edit: fixed position of ':')

It was very hard to spot for me because it does not show up in any debugging logs. I am not very good at git so could someone do the magic and offer a fix and merge (or at least update doc, I spent weeks on this problem and I am pretty sure other people will run into this) @rkdarst ?