jupyter-server / jupyter_server

The backend—i.e. core services, APIs, and REST endpoints—to Jupyter web applications.
https://jupyter-server.readthedocs.io
BSD 3-Clause "New" or "Revised" License
468 stars 283 forks source link

identity API #638

Closed minrk closed 2 years ago

minrk commented 2 years ago

Problem

As we proceed with authorization (#165, https://github.com/jupyterlab/jupyterlab/issues/11434, https://github.com/jupyterlab/jupyterlab/issues/11355), it's becoming apparent that frontends like JupyterLab are going to want to know, at least to some degree, what permissions they have. They also want things like the name, etc. for populating the identity widget (see https://github.com/jupyterlab/jupyterlab/pull/11443).

Proposed Solution

Example:

user = self.get_current_user()
# backward-compat, cast str to username
if isinstance(user, str):
    user = {"username": user} # jupyterhub uses
else:
    user_model = user # assumes dict. Technically could have been any truthy object, before
user_model["permissions"] = self.authorizer.get_permissions(user)

User model (coordinate fields with https://github.com/jupyterlab/jupyterlab/pull/11443) should have at least a username string and permissions dict, representing state of #165.

{
    "username": str, # only required name field. jupyterhub uses only 'name' because it has no 'real' name fields
    # other name fields should all be optional (should be defined, but may be null)
    "given_name": str, # optional
    "permissions": {
        "resource": ["read", "write", "execute"],
    },
}

The permissions part is tricky, because this assumes clear, complete declarative permissions. However, none of the proposed examples in #165 actually work that way. An alternative for the permissions part would be to have an explicit check permissions endpoint that takes a list of permissions to check, and returns which are permitted. This avoids the need to always return all possible permissions, and only returns answers to the questions the client needs to know, and would remove the need for permissions to be part of the identity model.

In JupyterHub, at least, permissions are a declarative property of the identity model, so we could shift the implementation of the Authorizer in that direction, too.

Additional context

davidbrochart commented 2 years ago

Also pointing to the auth plugin in Jupyverse, where some work was started on this. We use FastAPI-Users which has GET /me endpoint for the current user identity.

echarles commented 2 years ago

Talking about users (with a s), I have opened a PR so support sessions https://github.com/jupyter-server/jupyter_server/pull/391

That PR should be rename "Multiuser support for jupyter server".

In a RTC context with multiple users on a single jupyter-server, asking for "me" requires that PR (or similar)

fcollonval commented 2 years ago

An alternative for the permissions part would be to have an explicit check permissions endpoint that takes a list of permissions to check, and returns which are permitted. This avoids the need to always return all possible permissions, and only returns answers to the questions the client needs to know, and would remove the need for permissions to be part of the identity model.

It have also the nice side effect that permission for an user may evolve during a session. For example, an user A may share its server with an user B granting him read-only access. But during the collaborative session, user A decides it will grant execute access to B. This should work transparently for user B.

This scenario is inspired by the workflow of VS Code live share that allows read-only access to terminals by default. But you can during a session change that permission to write.

This is also coherent with the classical security pattern that authorization and authentication rights should not be stored permanently but should be refreshed periodically. So the authorization and the authentication cache will be handled by the server and the client should not persist the permissions.

minrk commented 2 years ago

Yes, and as much as possible, the frontend shouldn't need to ask what permissions it has ahead of attempting to take actions that may fail. But I suspect there will be a few cases, at least, where it will want to check to disable certain UI/behaviors.

@echarles I'm not quite sure what the sessions would be required for here. In JupyterHub, and custom endpoints in general, defining get_current_user to return the right user model should be enough, right? Sessions may still make sense, but I don't think they'd be required for /me.

echarles commented 2 years ago

@echarles I'm not quite sure what the sessions would be required for here. In JupyterHub, and custom endpoints in general, defining get_current_user to return the right user model should be enough, right? Sessions may still make sense, but I don't think they'd be required for /me.

Let's say you have User1 and User2 connected on the same Jupyter Server / Tornado instance. When User1 hits an endpoint, I expect the server (or the extension running on top) to be able to discriminate what user is making the request. My understanding is that today the system can not say that.

But maybe I miss something: or it is not needed to know who is making the request, or jupyter-server/tornado can say that today, or in your mental model jupyterhub is responsible to say that ?

I have renamed https://github.com/jupyter-server/jupyter_server/pull/391 to "Multi user server wit session management"

davidbrochart commented 2 years ago

Let's say you have User1 and User2 connected on the same Jupyter Server / Tornado instance. When User1 hits an endpoint, I expect the server (or the extension running on top) to be able to discriminate what user is making the request. My understanding is that today the system can not say that.

I think it does, this is the authentication system that sets a cookie in the user's browser after he logs in.

minrk commented 2 years ago

My understanding is that today the system can not say that....in your mental model jupyterhub is responsible to say that ?

Yes, it's absolutely JupyterHub's responsibility (or whatever authentication plugin you use that overrides LoginHandler.get_user). When you run under Jupyterhub, this already works. The default LoginHandler implementation sets the same "anonymous" username for all authenticated requests, though, because it hasn't yet had a reason to distinguish between connections. We would also need to change our token authentication to generate different tokens for different users to distinguish between users by default.

The system as a whole doesn't discriminate between browser sessions, though, so if you need to distinguish between 'me' in Firefox and 'me' in Safari, that's certainly something Sessions would be needed for, e.g. multiple RTC connections that may have equivalent credentials. Multiple sessions and multiple users are definitely related, but not quite the same thing.

echarles commented 2 years ago

Maybe the following case will help to discussion. As it is today (without going into the session, cookie... technical stuff), I'd like to make sure the user who query the single /me endpoint of that single Server get different response (so User1 gets id1 and User2 gets id2). My understanding based on the code and experiments is that it is not possible today, but happy to see it in action.

   Server
     |
 +---------+
 |         |
User1    User2
minrk commented 2 years ago

For my understanding of RTC requirements, I definitely think your session-tracking feature makes sense, independent of whether Users are distinguishable.

I take back what I said about user_id always being "anonymous", though. That's only true when auth is disabled. It is set to a random uuid for each separate login cookie that's set. So it would already be the case that every browser has a different UUID for a username with the default implementation. We could certainly switch this to be a more realistically populated random User dict, if that would be useful.

However, in JupyterHub, or any other LoginHandler that implements actual authentication, multiple browsers logged in as the same user would be equivalent, and you would absolutely need a separate 'session' concept to distinguish them.

Still, I think it's important to make clear that two sessions may have the same user, and not conflate the two. We can certainly decide that the /me endpoint should return session info (after #391) in addition to user info, or use a separate endpoint for that, but I think we definitely should not require that two sessions have different Users.

echarles commented 2 years ago

For my understanding of RTC requirements, I definitely think your session-tracking feature makes sense, independent of whether Users are distinguishable.

Thx. I was discussing this Identity API with RTC in mind, I should have make it clear. Are you separating RTC from this, or is it OK to continue englobing that ?

I take back what I said about user_id always being "anonymous", though. That's only true when auth is disabled. It is set to a random uuid for each separate login cookie that's set. So it would already be the case that every browser has a different UUID for a username with the default implementation. We could certainly switch this to be a more realistically populated random User dict, if that would be useful. However, in JupyterHub, or any other LoginHandler that implements actual authentication, multiple browsers logged in as the same user would be equivalent, and you would absolutely need a separate 'session' concept to distinguish them.

I am following the great work the JupterHub is doing, especially the latest feature around authorization, but I am trying for now to discuss without a hard requirement on JupyterHub. Jupyter Server Identity should work nicely with JupyterHub, RTC, and the rest of the world.

Still, I think it's important to make clear that two sessions may have the same user, and not conflate the two. We can certainly decide that the /me endpoint should return session info (after #391) in addition to user info, or use a separate endpoint for that, but I think we definitely should not require that two sessions have different Users.

Web sessions are tricky.

They are used to distinguish and put server information for User1 (running on laptop1 with browser1) vs User2 (running on laptop2 with browser2).

My experience in other frameworks/languages is that User1 (running on laptop1 with browser1) and User1 (running on laptop1 with browser2) - think you connecting with the same credentials to the same server on Chrome and on Firefox - will be assigned 2 different server sessions. this is not much different from what I see when I connect (even with 2 different tabs in Chrome) to the same Google Meet session: You will be seen as 2 different users, with the same picto - If this duplication/multiplication could be real.... :)

But I would consider this last point (same user with different sessions) like a edge case which should be taken / solved for now.

minrk commented 2 years ago

Yes, I don't want to assume JupyterHub for any of this, either. From Jupyter Server's perspective, all JupyterHub does is implement Jupyter Server's declared extension API of LoginHandler.get_user to return a dict representing the current user. For the purposes here, I think it is important to establish that the JupyterHub implementation will not be unique if two browsers are logged in as the same user, which means that it doesn't affect the requirements for #391. We could push that session-uniqueness responsibility to the implementation of get_user (JupyterHub can manage this, and indeed already has a unique session id to work with), or take it on here so that get_user is only responsible for managing the User, but not managing the session. I personally think that's the right way to go. In either case, we have to implement both for the default implementation.

I just want it to be clear that session management is a level above users, and some things are associated with the session, while others are associated with the user. It seems important to not conflate the two.

I'm AOK with deciding that /me returns info about the "user session", including both user and session info (once it exists), with different guarantees about uniqueness, etc.

echarles commented 2 years ago

I am OK with the previous.

I just want to emphasis that in a RTC work, to deliver /me with the minimum info (the userid), you need a multi-user server, and with jupyter-server/tornado, the sessions mechanism of https://github.com/jupyter-server/jupyter_server/pull/391 delivers that multiuser feature.

I just want it to be clear that session management is a level above users, and some things are associated with the session, while others are associated with the user. It seems important to not conflate the two.

Agree. the complete levels look more something like that (rtc needs multiuser, which needs sessions that deliver user info for each users)

rtc
multiuser
sessions
user
minrk commented 2 years ago

I'm not up to speed on the session storage requirements for RTC. What would get stored in the session-store that is added in #391?

391 seems to add a lot of per-session logic (arbitrary per-session key-value store). Do you foresee that as a substantial need for RTC, or is the need mostly the presence of unique session ids (since user ids will not be unique)? Storing the user info in the session store doesn't make a lot of sense to me, since that should come from the auth provider.

If it's only the unique session id, then it seems like it could be accomplished with a substantially smaller change:

  1. on login, store a unique session_id cookie, removing the not-always-true assumption that get_user is unique
  2. provide the session id where appropriate (e.g. in /me response or PageConfig data)
echarles commented 2 years ago

There may be smaller changes like you explained, but I am looking at (maybe the only one here...) a stronger solution where any server extension will benefit from a read/write KV session storage to put the user information it needs.

We are not building here a ecommerce website, but the typical example is an extension maintaining a shopping basket.

For jupyter, this could be the list of opened notebook with for each notebook the connected users with their permission.

Just put the session infrastructure in place (which BTW can still be qualified as small change) and use cases will come.

minrk commented 2 years ago

I think we might be getting a little off track here, but I don't agree that we should build significant new features without specific, concrete uses in mind. I'm not saying those don't exist, I just don't know what they are.

I'm not sure how relevant this discussion is to the identity api, other than the fact that it will make sense to include a session id if/when one exists.

minrk commented 2 years ago

I didn't click through to #122 which has the more detailed discussion, sorry! I see the discussion in more detail, there. In any case, I'll leave the session management discussion to those already participating there.

echarles commented 2 years ago

Sorry, I have given the impression that I wanted a feature and then use case would come. My thought is that the primary use case for Identity API is RTC and that RTC needs sessions. But it looks like you have a clear view on how to implement that Identity API, so that's fine. Sorry for the distraction.

minrk commented 2 years ago

Sorry, my fault for not catching up with the relevant discussions! I think the main thing to establish here is if this endpoint is somehow redundant or conflicting with your session storage plan. As long as this endpoint still makes sense, and the main interaction is what exactly goes in the model, then I think everything's alright.

hbcarlos commented 2 years ago

Thank you @minrk, for opening up the conversation.

It's becoming apparent that frontends like JupyterLab are going to want to know, at least to some degree, what permissions they have. They also want things like the name, etc. for populating the identity widget.

Yes. Now that we introduced RTC, we have multiple users accessing the same server. This raises the necessity of identity. We need to show every user the identity of everyone with access to the server.

In addition, not every user must have the same level of permission (some of them will be able only to read documents, while others can also write, but only one or a small group of them will have access to settings, terminals, and kernels). Each user will have a slightly different UI depending on the permission level. For this purpose, we need to know in advance the scopes of the current user, and it is not necessary to see the permission level of the user when launching the server (config_data object) but at least have an endpoint where we can request the identity and permissions of the user.

An alternative for the permissions part would be to have an explicit check permissions endpoint that takes a list of permissions to check, and returns which are permitted. This avoids the need to always return all possible permissions, and only returns answers to the questions the client needs to know, and would remove the need for permissions to be part of the identity model.

Even though this approach would be enough for JupyterLab to check the user's scopes, in my opinion, the identity endpoint should return the identity of the user with his permissions.

For my understanding, what #165 proposes is a hook is_authorized that returns true/false depending on the user's permissions, and you are suggesting the same to check whether the user has the permission that the frontend is requesting. What I do not understand is, to be able to return true/false to the "question" is_authorized, the authorization provider should know the user's scopes, so my question is, why don't we create another hook where the authorization provider returns the list of scopes?

We do not need to declare the list of scopes in advance. We can default to the most restrictive permission if JupyterLab expects a specific scope, not on this list.

hbcarlos commented 2 years ago

Another topic I wanted to discuss (let me know if this should be in another issue) is the possibility of having an endpoint in Jupyter-Server to request who has access to the server and which permission.

During the RTC meeting, I asked about the possibility of having a connected users endpoint in Jupyter-Server. I was wrong. The endpoint that we need is to know which users have permission to access the server and also an endpoint to allow users to grant permissions to other users. I believe this is out of the scope of Jupyter-Server but more an answer that the authorization provider should respond. Is it possible to create another hook, like is_authorized, that allows the authorization provider to implement the logic for these endpoints?

minrk commented 2 years ago

We need to show every user the identity of everyone with access to the server.

Why do we need that? I don't quite follow the need to see information about everyone who might have access.

For my understanding, what #165 proposes is a hook is_authorized that returns true/false depending on the user's permissions, and you are suggesting the same to check whether the user has the permission that the frontend is requesting. What I do not understand is, to be able to return true/false to the "question" is_authorized, the authorization provider should know the user's scopes, so my question is, why don't we create another hook where the authorization provider returns the list of scopes?

Requiring that the user be able to return complete permissions is a substantial change in specification (that doesn't mean it's wrong), because it means all possible scopes for all possible extension must be knowable, and complete permissions for given user must be available in a static form.

The implementation as it is now is very simple because is_authorized only needs to be able to return true/false for a given action/resource combination. It doesn't need to know all possible combinations of action/resource, only those that are used, when they are used.

For instance, the default "AlwaysAllowAuthorizer" that preserves the current behavior of the server looks like:

def is_authorized(action, resource):
    return true

whereas to return the corresponding list of permissions as part of the user model would require:

We do not need to declare the list of scopes in advance.

It does not need to be in advance, but it does need to be available to compute the list of permissions when the user model is requested. That effectively makes it a required part of the Jupyter Server Extension API to declare any and all permissions it will use.

The explicit "check permissions" endpoint, on the other hand, is vastly simpler because it matches calls to is_authorized exactly.