jupyterhub / jupyterhub

Multi-user server for Jupyter notebooks
https://jupyterhub.readthedocs.io
Other
7.75k stars 2.01k forks source link

Named server backend implementation questions #1956

Closed krinsman closed 6 years ago

krinsman commented 6 years ago

I have some dumb questions about what changes and where need to be made to JupyterHub internals in order to enable the named servers feature. There are obviously a lot of them, so if it is only possible to answer one or two of them (due to time constraints and similar issues), I would still be immensely grateful if you did so. My goal is to avoid doing something massively stupid and consequently later making a garbage PR which even more completely wastes your time.

Here are my guesses about what at a minimum might need to be changed; please feel free to correct anything and everything that is wrong:

Here are some actual, concrete questions:

  1. Currently, it seems that all of the notebook server associated with a given user have to be/are expected to be on the same host. Is this correct? If so, this would also need to be changed to enable named servers since it isn't a reasonable assumption on practice (for example, a user at an HPC might want to have one notebook server on a remote host which is a node dedicated to interactive jobs, and another notebook server on a different remote host which is a batch job node).
  2. Is the idea that each of the named servers would be identified by a given method by the parameter value server_name? (The current default values for that is always ''.) And is the conclusion correct, that there is currently no connection implemented between the value of allow_named_servers and the possible input values for server_name?
  3. Depending on the answers to the second question, why is server_name an input value of the spawn method? Does this prevent the remote host for the new (named) notebook server for the user from being chosen dynamically at run-time? Imagine the case where there are two remote nodes on which users at an HPC can spawn notebook servers, and the administrators want to implement some sort of load-balancing scheme. Would some code like the following be necessary to add to the spawn method in the User class:
if self.allow_named_servers:
    # an empty string as the value of `server_name` passed to `spawn` would indicate that the notebook server/remote host for the new notebook server should be determined dynamically at run-time
    if server_name == '':
        server_name = random.choice(list_of_possible_servers_or_remote_hosts)
        ...
    else:
        ...

Obviously that is very stupid pseudo-code, but maybe it helps to communicate more clearly what I am trying to ask.

  1. The analogous question might apply for progress_url as well, but to be honest I don't really understand what that method is intended to do, so I am not actually sure.
  2. The spawners attribute of the User class wouldn't really have anything to do with any named server implementation, right? It's not listing the possible remote hosts, or possible names that could be given to the notebook servers for a given user, it's just listing the possible spawning methods (e.g. DockerSpawner or KubernetesSpawner)?
  3. Along the lines of question 5, there is no reason to expect there to be any mapping (mathematical function, each input has unique output) between possible spawners, possible names for users' notebook servers, and possible (remote) hosts on which the users' notebook servers could be hosted?
  4. This could probably also be merged with questions 5 and 6, but just to confirm, there is no reason to expect that anything needs to be done with or to the _SpawnerDict class in order to implement named servers?
  5. Does the test_named_servers.py contain a (mock) implementation of the named servers feature? I don't really understand in principle how the feature could be at all tested if it hasn't even been implemented yet. It's also a great deal longer than what I would have expected for tests of an unimplemented feature. The major problem is most likely just that I don't actually understand, even at a high level, what is supposed to be accomplished by the functions in that code. The part of the JupyterHub code I am by far the most familiar with is that for the spawners, and even then I would consider my understanding of that rocky at best.

Again, if it at all possible to help with even one of these questions/concerns, it would be immensely appreciated. Please enjoy the rest of your week!

krinsman commented 6 years ago

Edit:

Although this conclusion directly contradicts what it says in the Spawner class here:

https://github.com/jupyterhub/jupyterhub/blob/dc75a9a4b717a436853d580f9530b504fe6642d3/jupyterhub/spawner.py#L49

There it says that there is a one-to-one correspondence between users and spawners, 1 user, 1 spawner. But the above interpretation of the ORM suggests that each user can have multiple instances of spawners (i.e. more than, and not, just one), with each instance corresponding to a different Jupyter Notebook server.

minrk commented 6 years ago

Thanks for digging in to this!

First, I would clarify that named servers do already work and are fully implemented in JupyterHub 0.8. They just need to be created via the REST API, rather than human-clicky UI controls. This is what is tested in test_named_servers. What's missing and required for #1935 is UI for this, and likely updates to a lot of default behavior regarding assumptions that the 'default' server should be selected/launched in various circumstances (e.g. automatically triggering spawn of the default server at login).

Some additions to the orm

It's likely there should be no changes required to the ORM, since named servers are already implemented there. Once we have UI, we may seek additional features such as persisting user_option as discussed in #1935, but that can be done later.

Some additions to the JupyterHub application which cause different behavior to occur when self.allow_named_servers is set to True

These will mostly involve 'default' behaviors, such as triggering spawn on login, default URL, etc. Right now, allow_named_servers only gates the named-server endpoints in the REST API. It is not a dummy attribute, though.

Some additions to the User class which will again cause different behavior when self.allow_named_servers is set to True

It's unlikely that the User class will need significant changes. Most of the changes will probably belong in how the user object is communicated with at the Handler level, in particular cases where there are references to user.spawner and user.pending, etc. which assume that the default Spawner's state is relevant. We will need to revisit many of these.

Currently, it seems that all of the notebook server associated with a given user have to be/are expected to be on the same host. Is this correct?

It is not correct. Servers may be anywhere network-accessible from the Hub/proxy. Spawner.start returns an (ip, port) tuple describing where the running server can be accessed. It does not need to be local.

Is the idea that each of the named servers would be identified by a given method by the parameter value server_name?

Yes, this is how it works right now. The default is '', but this is settable if servers are launched via the REST API

And is the conclusion correct, that there is currently no connection implemented between the value of allow_named_servers and the possible input values for server_name?

It is not. allow_named_servers=False prevents any attempts to launch servers with a name other than the default 'nameless' empty string. Only when allow_named_servers=True can users request servers with non-empty names.

The analogous question might apply for progress_url as well, but to be honest I don't really understand what that method is intended to do, so I am not actually sure.

progress_url should already be doing the right thing for named servers.

The spawners attribute of the User class wouldn't really have anything to do with any named server implementation, right?

User.spawners is where all of the Spawner instances for the named servers reside. It is a dictionary of Spawner instances keyed by the server name. Only when named servers are enabled does this dictionary have more than one entry. It's not listing possible names, it's listing currently allocated names. When you ask for a server named "server2", then user.servers["server2"] will be the Spawner instance corresponding to that name.

there is no reason to expect there to be any mapping (mathematical function, each input has unique output) between possible spawners, possible names for users' notebook servers, and possible (remote) hosts on which the users' notebook servers could be hosted?

There is a 1:1 mapping of name to existing Spawner. Every Spawner has a name, and names are unique per user. The names are not semantic, they are just user-chosen keys in a dictionary, so there is nothing related to hostnames here unless the Spawner provides it and the user chooses to encode that in the name.

This could probably also be merged with questions 5 and 6, but just to confirm, there is no reason to expect that anything needs to be done with or to the _SpawnerDict class in order to implement named servers?

This is already where named servers are stored. It is unlikely that anything needs to change here.

Does the test_named_servers.py contain a (mock) implementation of the named servers feature?

No, it contains a full implementation of named servers via the REST API.

Although this conclusion directly contradicts what it says in the Spawner class here:

Yes, that docstring hasn't been updated since named servers were introduced. It should be updated to account for named servers, since it will be at least N (one or more per user, or exactly one per running notebook server) rather than one per user.

krinsman commented 6 years ago

@minrk Thank you for taking the time to respond to this -- I really appreciate it.

My impression now, based on more review of the code since posting the question, as well as your very thorough and informative answer, is that most of the work with regards to implementing named servers for UI will involve the code in jupyterhub/handlers (and not, of course, jupyterhub/apihandlers, since as you point out named servers are already implemented via the API).

Question: In fewer words, given that I am interested in implementing a UI for the named servers, would you recommend that I focus above all on the code for the handlers (jupyterhub/handlers)?

As far as I can tell, based on trying to better understand test_named_servers.py, the "magic" for implementing named servers via the API is primarily via the UserServerAPIHandler class in jupyterhub/apihandlers/users.py -- is that correct?

If so, what would you say is the most analogous class in jupyterhub/handlers that one should focus on first in trying to implement named servers for UI? Would it be SpawnHandler in jupyterhub/handlers/pages.py? And if so, do you think it would make more sense to create a new handler altogether, or only to modify this and other relevant existing handlers?

If it is actually the spawning features of BaseHandler which one should look at first, is it terms of changing defaults? That one seems less likely to me, since UserServerAPIHandler already implements named servers via the API and is a "grandchild" of BaseHandler. But needless to say I still don't fully understand all of the internals of what happens when a user spawns a new notebook server.

minrk commented 6 years ago

UserSpawnHandler is responsible for serving the /users/yourname page when a server is not running at that URL. This is what triggers spawning right now (i.e. attempting to visit a server triggers its launch). Our choices right now are:

  1. provide a spawn page with inputs there (like we do for options_form), or
  2. move the spawn requesting to API calls from javascript on the home page

I suspect the latter is what we will want, which means that most work will really belong in the home.html template and home.js javascript to build the UI that calls the REST API to launch named servers and then redirect to them.

davidbath commented 6 years ago

This may be considered a hijack of the thread, so feel free to kick me out :)

However, if there was a move towards (2) and making the webUI just perform REST API calls, could this help us to facilitate the opposing use cases we chatted about before - that sometimes visiting a /user/yourname page should automatically spawn a server (e.g. sharing a notebook link), and sometimes you don't want it to (e.g. a datascientist managing a self-service infrastructure, and wanting to stop consuming resources when their experiment is finished).

My thought process was that then all logic for this can move behind the API nicely, and options could be introduced to specify default handling (e.g. 'don't implicitly launch server'). This would then inform the way that the home page is rendered.

krinsman commented 6 years ago

@davidbath It's OK, I am also curious to know the answer to your question.

minrk commented 6 years ago

We could definitely consider disabling spawn upon visiting /user/:name/.... It may well make even more sense with named servers implemented, but we can do it anyway so that behavior is consistent. Then the case becomes: what to do with that URL? It would be quite reasonable, for instance, to show a button to trigger spawn, rather than triggering spawn immediately.

minrk commented 6 years ago

Most actions from pages are already API calls - stop is an API request, spawn from the admin page is already the same API request we would use for start on the home page. So making start be an API call would be quite simple. What would be a little less simple is handling visitors to /user/name/ in the various states that can occur (server starting but not ready, not running at all, etc.). Technically, it shouldn't be a challenge; it's mostly a matter of deciding what the user should experience in the various states everything can be in.

davidbath commented 6 years ago

yes - I was thinking that getting people to the same "start my server" button rather than seamlessly auto-starting would be very reasonable. Indeed, this could always be a config option. Would there be a downside of allowing someone to set an single_user_autospawn_profile = [profile name] which, if specified, would specify what happens if you just visited /user/:name/? If it was unset, then it always takes you to the spawner profile page.

rcthomas commented 6 years ago

I have a slight spin on this question to add. We've been looking at wrapspawner/profilespawner (or customization of it) as a way to manage spawning servers on different systems and/or in a batch job. We thought we could enable named servers but assign the server name (just use the profile name for instance) instead of letting the user pick it. Mainly we just wanted a way to handle the case where users start >1 servers but need a way to keep them straight.

In that case the thing that seemed right to us at first was what I mentioned initially in #1935 (from the options form based on the selected profile). But I am trying to figure out/ask if @minrk's suggested approach through "home" (what's outlined at "There's another option..." here) is compatible with that.

For instance we could implement that but then in our custom case disable user definition of the name and set it when the spawn happens. We would still need all the other UI pieces in our case so we would want to keep working on that.

minrk commented 6 years ago

We thought we could enable named servers but assign the server name

Not at the moment, because the Spawner has been assigned a name before the object is created. It cannot pick its own name. That's not to say it can't be changed in the future, but that's how it works right now. So as it is right now, the spawner name is always picked by the user before any logic from the Spawner class is invoked.

krinsman commented 6 years ago

@minrk This thread was basically created in order to ask about how to make this PR: https://github.com/jupyterhub/jupyterhub/pull/2089

Thus I am not sure if I should close this issue, and have all discussion relegated to that PR, or if both should be left open for now.

krinsman commented 6 years ago

Made obsolete by https://github.com/jupyterhub/jupyterhub/pull/2089