User hook for the build endpoint #1117

Open jtpio opened 4 years ago

Proposed change

This issue is related to the idea mentioned in this Discourse topic: https://discourse.jupyter.org/t/binderhub-with-private-gitlab-and-user-scopes/3502

Looking at the code, it seems like there is (at the moment) no hook or option that could be set to tweak the behavior of the /build endpoint, or more generally of the buidler.

The idea is to be able to implement fine-grained access control to BinderHub based on the JupyterHub authenticator used to authenticate users.

The use case is summarized as follows:

Users authenticate to their BinderHub using their private GitLab instance as the authenticator
This means that each BinderHub user now corresponds to a GitLab user
When they enter the repository in the input field, they can only build repositories they have access to
If they don't have access, the Could not resolve ref for my-project/repo. Double check your URL. would ideally be shown
This would happen before triggering a new build
[Optional] The UI only shows GitLab in the dropdown menu (after configuring repo_providers). This looks like it should be solved by https://github.com/jupyterhub/binderhub/pull/1038 :tada:

Alternative options

An alternative option might be to add an extra build handler to the main app, and change the frontend to use that endpoint instead.

However this adds a lot of complexity to the BinderHub admin as it would require maintaining custom Docker images and helm charts with these changes.

Who would use this feature?

Those who want to have a custom BinderHub setup implementing user access based on the user access pattern from the JupyterHub authenticator (GitLab, GitHub).

(Optional): Suggest a solution

Provided that an access token was generated according to: https://binderhub.readthedocs.io/en/latest/zero-to-binderhub/setup-binderhub.html#accessing-private-repositories

For a binderhub user that has read-only access to all repositories.

And the token set as:

config:
  GitLabRepoProvider:
    private_token: "<access token>"

At the moment it's possible to have control on the launch behavior, by providing the following snippet to the helm chart config:

https://github.com/jupyterhub/binderhub/blob/b6446b12b30f741d9e82b7aec1498ede4776cd79/helm-chart/binderhub/values.yaml#L66-L119

However users can still trigger a build to a repository they do not have access to.

It looks like this could be implemented by providing a custom RepoProvider (in the helm config value, that could derive from an existing one).

But it would require some user specific information to be passed to the RepoProvider to be able to decide whether or not it is possible to resolve the ref for that user, probably somewhere around this line:

https://github.com/jupyterhub/binderhub/blob/72bcb59cf956f53a07f0d4b45f12cc6c1257c6cf/binderhub/builder.py#L251

A custom hook similar to the pre_spawn_hook or user_redirect_hook in JupyterHub could also help.

Or how about having a pre_build_hook, similar to the existing pre_launch_hook?

https://github.com/jupyterhub/binderhub/blob/72bcb59cf956f53a07f0d4b45f12cc6c1257c6cf/binderhub/launcher.py#L67-L78

The pre_build_hook could then perform some API requests to GitHub / GitLab to check if a user has access to a specific repo.

I am for pre_build_hook:

you could check anything (similar to the check if spec is valid) before the build process starts. so probably the pre_build_hook should be called just before these lines

https://github.com/jupyterhub/binderhub/blob/72bcb59cf956f53a07f0d4b45f12cc6c1257c6cf/binderhub/builder.py#L234-L240

in hook you could reach user data easily (probably) with user_model = self.hub_auth.get_user(self)

Yes that would be the idea :+1:

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/binderhub-with-private-gitlab-and-user-scopes/3502/5

One thing we have to be careful about/make clear to the admin is the difference between the auth token obtained for the user and the one that currently exists which is for the whole BinderHub.

The other thing is passing around/making accessible the user's token at all the right places.

This would be a nice new feature!

Maybe the handler could be passed to the pre_build_hook directly?

Something like the following:

pre_build_hook = self.settings['pre_build_hook']
if pre_build_hook:
    await maybe_future(pre_build_hook(self))

Then it's up to the user to decide what to do with the build handler.

Similar to the way the handler is made available to the spawner in JupyterHub: https://github.com/jupyterhub/jupyterhub/blob/76c9111d80660e93578f80dbe441cfb702c1b207/jupyterhub/user.py#L542-L544

Maybe the handler could be passed to the pre_build_hook directly?

yes, thats also what I thought. I think the same is also done in pre_launch_hook, launcher itself is the first parameter.

Btw after reading @betatim s comment, it is not clear to me: for your case this won't require any additional token for each user, right?

This wouldn't require additional token. In the hook we could for example retrieve the user name with the snippet you posted above:

in hook you could reach user data easily (probably) with user_model = self.hub_auth.get_user(self)

Although this would not give the user auth_state I think? But the provided git_credentials token could still be used to make HTTP requests and check the user access using the username.

Although this would not give the user auth_state I think?

I am not sure but yes, I think user_model dict doesnt contain auth_state. But by using the username you can make a request to JupyterHub API (users/<username>) and get user data, which should contain the auth_state.

There's an open issue to make auth_state available: https://github.com/jupyterhub/jupyterhub/issues/1704 @bitnik Are you saying it's already possible?

it must be available for admin users: https://github.com/jupyterhub/jupyterhub/blob/76c9111d80660e93578f80dbe441cfb702c1b207/jupyterhub/apihandlers/users.py#L126-L138

and because binder service has admin access to hub API, this should work for @jtpio 's case.

Thanks @manics and @bitnik for the context and pointers!

If the binder user is an admin, they there could indeed be a request to the hub API to retrieve the user's auth_state in the pre_build_hook.

Just tested and we can indeed retrieve the user auth_state :+1:

For example in the pre_launch_hook with:

async def pre_launch_hook(launcher, image, username, server_name, repo_url):
    user = await launcher.get_user_data(username)
    auth_state = user.get('auth_state', None)

With a pre_build_hook, we could probably achieve a similar thing with:

async def pre_build_hook(handler):
    user_model = handler.hub_auth.get_user(handler)
    username = user_model['name']

    # ideally reusing the api_request or get_user_data methods from the launcher 
    resp = await api_request(f'users/{username}', method='GET')
    user = json.loads(resp.body.decode('utf-8'))
    auth_state = user.get('auth_state', None)

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/binderhub-with-private-gitlab-and-user-scopes/3502/6

For a use case where we would want to run an authenticated BinderHub instance whose rights for cloning private repositories would match those of an underlying Gitlab instance (and where the Gitlab service would also provide authentication), if I understand correctly, a pre_build_hook would still require a unique token to clone all private repositories within the gitlab instance?

Instead, in an authenticated BinderHub, it might be desirable to assume the identity of the authenticated user for cloning private repositories -- if only for the user experience (this would remove the need to add a technical "binderhub" user to the gitlab instance and to make it a member of each project to be built).

Would there be a solution that would remove the need for a single user/token that has (at least read) access to the whole set of private repositories within a gitlab instance, while being minimally disturbing to the existing Binderhub model ?

jupyterhub / binderhub