Option to extract additional token in generic OAuth2 authenticator

michaeljfazio commented 2 years ago

Proposed change

I would like to extract an OIDC ID token in auth_state_hook. This seems like it should be straight forward when using the generic authenticator however it is not OIDC friendly. Specifically, it does not provide an easy way to extract the id_token.

Alternative options

Option 1: Allow users to specify additional tokens to be extracted into auth state (e.g. c.LocalGenericOAuthenticator.extra_tokens['id_token'] Option 2. Automatically detect if the token response contains an "id_token" and add it to the auth_state implicitly.

Who would use this feature?

In my case this is useful so that I can take the id_token and use it to register federated users in an identity pool before converting it to temporary AWS credentials server side using something like this auth hook:

async def pre_spawn_hook(spawner: Spawner):
    try:
        auth_state = await spawner.user.get_auth_state()

        id_token = auth_state['id_token']
        decoded = jwt.decode(id_token, options={"verify_signature": False}, audience=c.LocalGenericOIDCAuthenticator.client_id)
        cognito_provider = decoded['iss'][8:]
        identity_pool_id = os.environ.get('IDENTITY_POOL_ID')

        cognito = boto3.client('cognito-identity')
        logins = { cognito_provider: id_token }

        identity_id = cognito.get_id(IdentityPoolId=identity_pool_id, Logins=logins)['IdentityId']
        credentials = cognito.get_credentials_for_identity(IdentityId=identity_id, Logins=logins)['Credentials']

        spawner.environment['AWS_ACCESS_KEY_ID'] = credentials['AccessKeyId']
        spawner.environment['AWS_SECRET_ACCESS_KEY'] = credentials['SecretKey']
        spawner.environment['AWS_SESSION_TOKEN'] = credentials['SessionToken']
    except:
        spawner.log.error(traceback.format_exc())

welcome[bot] commented 2 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

manics commented 2 years ago

Do you think we could add the entire token_response as a field of auth_state? https://github.com/jupyterhub/oauthenticator/blob/3bcb7a7293ba99bae16eddb37e3177082a54234c/oauthenticator/generic.py#L138-L151

michaeljfazio commented 2 years ago

That seems sensible also. Certainly simplifies things.

mahic commented 1 year ago

@manics @michaeljfazio The OAuthenticator hasn't had any releases since 15.1. There has been a major refactoring in the subsystem since the last release, and to me it seems as some of the refactorings now include the id_token as well?

@GeorgianaElena @consideRatio can we expect some releases of this soon? It would help us out a lot :-)

GeorgianaElena commented 1 year ago

@mahic, the new release is being tracked in https://github.com/jupyterhub/oauthenticator/issues/585 🚀

benjimin commented 8 months ago

I second this feature request.

For context, OIDC is the widely-used authentication standard built on top of the OAuth authorisation standard, and OIDC offers two separate avenues for retrieving information regarding the user: an ID token and a UserInfo endpoint. The process is:

The JupyterHub hub redirects the user agent (web browser) to the OpenID provider's authorisation (login) endpoint.
The identity provider authenticates the user and redirects their agent back to the hub with an opaque code.
The hub directly contacts the identity provider's token endpoint, and exchanges that code for a bundle of tokens, which will include both the access token and the ID token. (These are each formatted as JWT tokens, that is, as base64url-encoded and cryptographically-signed JSON-formatted collections of claims.)
The hub directly contacts the identity provider's UserInfo endpoint, and exchanges that access token for a simple JSON-formatted collection of claims.

I think JupyterHub should give admins the choice of whether to use claims from the UserInfo endpoint (step 4) versus from the ID token (step 3). In the latter choice, the hub could entirely skip the final network request, but would instead be responsible for decoding (and validating) the ID token to extract its claims.

Some identity providers will supply different sets of claims via each of these two avenues, for example, AWS Cognito currently provides cognito:groups (i.e. the list of groups that the user is a member of) as a claim in both of the tokens but not in the JSON from the UserInfo endpoint. (Note that JupyterHub admins commonly desire to customise profile lists according group membership; an example is even suggested in the Z2JH docs. GenericOAuthenticator also expressly facilitates using group membership to decide login and admin privileges. Also AWS is a popular infrastructure provider.)

Currently GenericOAuthenticator discards the ID token and loads only the UserInfo response into the auth_state (in the "oauth_user" aka "user_info" field, from where it is used e.g. to extract the claim_groups_key field). I think an option to instead parse the ID token should be obligatory for general OIDC support (e.g. #254) and would better support deployments in AWS (see https://github.com/jupyterhub/oauthenticator/pull/708#issuecomment-1840936101). While copying the raw tokens into auth_state could be a useful start, I think this library should definitely handle extracting claims from the ID token itself (to minimise complexity of downstream customisation code and encourage security best practices).

minrk commented 8 months ago

Thanks for the detailed writeup! I think including the id token makes sense, especially for OIDC, and further reinforces that having an actual OIDCAuthenticator which can make the appropriate assumptions that generic OAuth can't is a thing we should have (#254).

nocnokneo commented 6 months ago

I agree with @benjimin's proposal. There could be a few options exposed as a user_info_source config:

USERINFO_ENDPOINT_ONLY (default for backward compatibility)
ID_TOKEN_ONLY
PREFER_USERINFO_ENDPOINT
PREFER_ID_TOKEN

Where the last two options merge the results from the userinfo endpoint and the ID token payload.

benjimin commented 6 months ago

I've created a PR (#725) proposing that if oauthenticator.userdata_url = None (note this is not the default and so conveys an unambiguous choice by the admin to not use any userinfo endpoint) then the ID Token is used, as a direct substitute for a response from a userinfo endpoint (the user_info dict), thus going on to populate claims into the user model stored in auth_state.

Note, It looks like it had already been possible (as of v16 as alluded by @GeorgianaElena) for spawner hooks etc to access the raw id token via auth_state, but additional custom code is required to extract the token claims that way. Also, the identity provider has until now been required to support an OIDC-style userinfo endpoint, because the fields used internally by JupyterHub (such as username and group membership) weren't able to be populated from an id token.

Not wanting to overcook things in absence of use-cases, this PR doesn't introduce any novel merge functionality (apologies @nocnokneo), and tries to keep the configuration complexity to a minimum. It certainly doesn't introduce full OIDC support (discovery mechanisms etc) like #254.

jupyterhub / oauthenticator