Closed swallace21 closed 3 years ago
Great question, @swallace21! Thank you for opening the issue :)
The reasons are broadly laid out in http://words.yuvi.in/post/the-littlest-jupyterhub/. TLDR is that setting up and running your own JupyterHub 'from scratch' takes time, effort & know-how. TLJH is a 'JupyterHub distribution' that bundles a bunch of opinionated choices for a specific use case. The blog post contains more detail :)
We should probably add a non-blog piece of documentation about this to the docs.
Thanks, Yuvi! :)
We want to create a JupyterHub for a data science class this Fall of 30-40 students. TLJH looks like a great solution, but since it is in an alpha-state we are hesitant to commit to it versus the standard JupyterHub linked in the first post.
In the documentation in the future, explicitly showing the opinionated parts could be super helpful when assessing whether or not to implement TLJH.
Reading through the blog post again the unique parts of TLJH compared to a non-opinionated JupyterHub are:
And everything from the paragraph: User environment
I just discovered this project launched in June - ironically I spent half of April setting up a Jupyterhub environment for our lab! Turns out most of the configuration I chose are the same as suggested here (with one exception that I use NGINX as a reverse proxy as I retain a healthy skepticism wrt running public services directly on nodejs).
I'd like to add a use-case, if I may - most of the docs refer to students/imply a teaching environment. But there is a similar-but-different use case which is to share a computing resource in a group - in our case a fairly-beefy compute server (we have sort of mid-tier computing requirements where everyone occasionally needs some brute force but it doesn't make sense for each person to have a pro workstation, so we share a 14-core system). It turns out the requirements are pretty much the same as running for students (e.g. need for each user to have some isolation from the system and from each other to minimize problems). But maybe including this as a use-case may help brainstorm other useful improvements.
@mangecoeur this is great feedback, thanks. Shared compute for research / analytics groups is definitely one of the use-cases of JupyterHub! There are some things that are a bit behind on TLJH (e.g. until recently you couldn't run HTTPS on it) which is why we've been hesitant to recommend it for "production" deployments. However, it would be fantastic if you could share the ways in which your team uses JupyterHub!
@choldgraf happy to! I'm dumping some of my experiences in no particular order - sorry if it ends up a bit long (I'm procrastinating)!
First to note I set up jupyterhub with jupyterlab, since this gave a much better, desktop-like experience.
The motivation for using Jupyterhub was to give each person in the group easy access to a Python environment backed by a powerful machine. Most of the users have STEM backgrounds in general but do not have a lot of programming experience or familiarity with command line tools. The machine had to be remotely accessible and shared - one option would have been remote desktops, but Jupyterhub gave a much neater solution with less overhead. The idea was not to heavily restrict what users could do, but to let them do what they need without the risk of messing up the system or worrying about admin tasks (this might be a big difference from a teaching environment).
There were several motivations for installing it locally (rather than using an exisiting notebook service or running our own service on a cloud provider). Firstly we wanted access to a powerful machine - this is either not an option or quickly expensive with cloud providers. Adding dynamic scaling on the other hand would have been completely overkill. Anyway a 'bare metal' local machine is simple faster and more responsive, and is physically accessible as needed. Another reason was limitations on where our data could legally be stored and processed. Finally it was much simpler to buy a one-off machine than set up a subscription.
People access Jupyter over https (100% necessary because of some madness about how our network is setup). By default they all share read-only access to the same conda python environment, which is also the one jupyter/hub runs in. This was a shortcut I took because I didn't have time to figure out how to pre-configure per-use envs for each person, and i didn't want people to be able to install anything system-wide. Instead, there is a set of default packages and the conda command is available globally so users can choose to create their custom conda env in their user directory and install ipykernel etc for the notebook (which is then picker up in jupyterhub). Thanks to Jupyterlab they can do this directly using the in-browser terminal (although people at this more advanced level mostly understand ssh etc).
We have a couple of services running on the same server (e.g. Rstudio server) so NGINX proxies everything to various paths (/jupyter, /rstudio) - this was actually one of the biggest headaches to get right - each service needed a different config to achieve the same thing and then there were odd cases with/without trailing slash in the URL. Jupyterhub needs to be configured with it's own prefix (i.e. no 'transparent' proxying). This was honestly one of the most frustrating time-sinks, especially with the need to have several services play together.
Another time-consuming thing of the server config was all the security bits - SSL certs, turning on the firewall. I enabled fail2ban because we immediately got attempted SSH logins from hacking bots. Probably other things I forget now - basically a long tail of config and tweaks that made me wish for some kind of 'appliance' that would deal with this - i looked at Docker/containers but that doesn't address host server configuration (it basically assumes you will run on the cloud and someone at Amazon will do all that for you). I think this project is going in the right direction!
Just to quickly add: HTTPS is definitely something that needs assitance. I'm still having issues sometimes with websockets (you can easily accept self-signed certs for https but in some browsers websockets die). Too many things in this space assume that server==public accessible page with a domain. So if you want to run an ip-only service with a quick self-signed cert, you will have problems (I'm thinking to create a root cert just for the group).
@mangecoeur please do keep sharing your experiences with HTTPS and any consistent pain points that are there. Remember this is a community-run project so your input (and or PRs) are super valuable! :-)
@choldgraf well pain points about HTTPS are: all of them! I found some guides but I'm basically doing crypto-arcana blind, trusting that the commands I'm copy pasting are correct. Part of the underlying issue is our IP out on the open internet (actually I have no idea why our institutions's network is like this), so things need to be reasonably secure. But I don't want to setup a domain name for it (one more thing to maintain, plus I'm afraid it will increase the online visibility of what is a private service). However HTTPS basically assumes you have a domain. In the best cases you can tell the browser to accept untrusted certs. It seems for Edge and Chrome you have to jump through more hoops to get it to remember them.
Today an update to Firefox nightly broke Websockets (refuse to connect to untrusted cert on wss) and I had similar issues connecting with Safari on an old IPad (Ok so both these are edge cases - one can expect a Nightly to break and the Ipad is on like iOS9).
I started today exploring making a DIY root CA. I'm thinking this could be the most robust method, combined with instructions (or even a script) to help people install it.
Another thing around auth/security - We have a small landing page with links to the services running (mainly jupyter, Rstudio, and shiny). I wanted to password protect this page using PAM-basicauth, which worked ok with nginx, except that I could not manage to disable auth for the proxied services, so people end up having to log in twice (one time basicauth, then jupyterhub auth). Not such a big deal, but something to add to the 'long tail' of config issues to get right for a truely seamless experience.
Ah while I'm at it: performance monitoring! Important to keep an eye one what people are doing, as well as to help each person decide how many resources they can/should use (e.g. don't try to use all the CPU when 4 other people are working). Each user might need to track how their programs are doing (or if they are even doing anything)
I installed Cockpit for server monitoring, but frankly (at least on ubuntu) its a bit bare - e.g it doesn't give you per-CPU stats and the memory usage doesn't make the difference between 'live' and 'cache' memory. Mostly handy for checking Systemd logs. Otherwise I use htop from the jupyterhub console.
It would be really nice to have some perf monitoring tools from the jupyterhub admin page. I can see jupyterhub people seeing this as out of scope though. Perhaps a way to write an extension to show this information could be good? (I though e.g. of something capable of displaying stats from https://nicolargo.github.io/glances/ ).
For TLJH I don't know if it makes sense to add anything beyond htop etc... but it's a thought (again, maybe glances in server mode or something).
@choldgraf just saw this today, might be useful for HTTPS : https://github.com/FiloSottile/mkcert
I've not read through this issue, but it seems like it was answered and that we have a prominent section to describe when to use TLJH, which I think could be seen to cover the question to some extent at least.
Was looking for a post/information why deploying TLJH for a 30 student class would be better than just deploying JupyterHub on a VM: https://github.com/jupyterhub/jupyterhub/blob/master/docs/source/quickstart.md
Thanks!