jupyterhub / binderhub

Run your code in the cloud, with technology so advanced, it feels like magic!
https://binderhub.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.53k stars 387 forks source link

Discussion: Should we support vendor-specific cloud API libraries in BinderHub? #1623

Open manics opened 1 year ago

manics commented 1 year ago

There's a few use-cases that benefit from using a cloud specific library to make API calls, e.g. using AWS boto3 to create an ECR container repository and to obtain a temporary read/write token https://github.com/jupyterhub/binderhub/pull/1055

Other public cloud registries may benefit from similar e.g. Oracle Cloud Infrastructure Registry (there's an autocreate option when pushing new images, but creating the repository in advance allows more control of things like auto-deletion), which requires the oci library.

There's probably others, either related to registries, or for other things like hooking into cloud notifications.

It's easy to have extras_requires in setup.py, or to put the new Registry (for example) implementation in a separate Python package since it's configurable with Traitlets, but what should we include in the container image? Just the ones used by mybinder.org and encourage everyone else to re-build their BinderHub container? Or should we include all of them? Do we take a completely different route and make those vendor specific API calls via a separate container (going down the microservices route)?

betatim commented 1 year ago

The philosophy of BinderHub so far has been to be "vendor agnostic". I think most often this leads to/is interpreted as "lowest common denominator", use the stuff that works equally well (or badly) everywhere.

I'm not familiar with "ECR container repository". I quickly googled it and it suggested "container registry" to me. Setting up a container registry sounds like a one time/setup task, not an ongoing thing that BinderHub does while it is running. Could you explain a bit what you had in mind? For one time setup stuff I think we should describe it in the guide(s). The vendor specific guides are a good example of how they are valuable but also often out of date (which I think is the relevant thing for deciding about "vendor specific" code as well).

Over the last year or so I've become more and more convinced of (and attracted to) the idea that having a plugin system is a great idea. In this case BinderHub would allow plugins to change/augment/extend parts of its behaviour. The advantage of having a plugin system is that anyone (including core maintainers) can extend BinderHub without needing to consider all the things/permissions/consensus of including it in the core. I think it also allows for a lot of creativity, and some kind of "combinatorial explosion" of things your software can do (think iPhone w/o app store (no plugin system) vs iPhone with app store (plugin system)). Maybe something like you have in mind would be a good use case of a plugin system?

Of course creating the "host side" of a plugin system is work and the quality of plugins rises and falls with how well it is done. JupyterHub already has a kinda plugin system for spawners and authenticators, so there is precedent for this working well.

I think a plugin system would imply that you need to make your own binderhub image?!

minrk commented 1 year ago

I think you're right that a lot of the great interface-defining work @manics and others have done is getting BinderHub for a level of maturity where it defines the interfaces, and implementations of non-default providers start moving to their own packages. But once you start breaking things up like that, it also starts to make sense to be doing more versioned releases to better communicate changes and compatibility at the API level.

I think a plugin system would imply that you need to make your own binderhub image?!

Yes and no - we see this in z2jh: z2jh's default image ships with a common set of plugins (then are they really plugins?), but you can always add more / select versions in a custom image. We still have to decide what's in this default set and what's not, which is a pretty difficult line to draw as everyone asks for their Authenticator to be added so they don't need a custom image.

I know a lot of supply chain folks bristle at the idea of install-at-runtime as a pattern, but I honestly think for plugin purposes that pip install at runtime is a hugely practical way to make small changes to an image without needing to build, host, and maintain a mostly duplicate image.

manics commented 1 year ago

I'm not familiar with "ECR container repository". I quickly googled it and it suggested "container registry" to me. Setting up a container registry sounds like a one time/setup task, not an ongoing thing that BinderHub does while it is running.

@betatim ECR (and some other container registries) don't support pushing to registry.example.org/account/new-repository-that-doesnt-exist, instead you need to create the repository using a vendor-specific API call, then you can push to registry.example.org/account/new-repository-that-doesnt-exist/<any-image-tags>. ECR has an additional complication that the registry login token is temporary and should be renewed at regular intervals, which requires another AWS API call.

I've had a go at implementing the microservice model with Oracle Cloud's registry: https://github.com/manics/oracle-container-repositories-svc

Example binderhub config stanza:

import json
from tornado import httpclient
from traitlets import Unicode
from binderhub.registry import DockerRegistry

class ExternalRegistryHelper(DockerRegistry):

    service_url = Unicode(
        "http://oracle-container-repositories-svc:8080",
        allow_none=False,
        help="The URL of the registry helper micro-service.",
        config=True,
    )

    auth_token = Unicode(
        "secret-token",
        help="The auth token to use when accessing the registry helper micro-service.",
        config=True,
    )

    async def get_image_manifest(self, image, tag):
        """
        If the container repository exists use the standard Docker Registry API
        to check for the image tag.
        Otherwise create the container repository.

        The full registry image URL has the form:
        CONTAINER_REGISTRY/OCIR_NAMESPACE/OCIR_IMAGE_NAME:TAG
        but the BinderHub image is OCIR_NAMESPACE/OCIR_IMAGE_NAME
        so we need to remove the OCIR_NAMESPACE component
        """
        client = httpclient.AsyncHTTPClient()
        image = image.split("/", 1)[1]
        repo_url = f"{self.service_url}/repo/{image}"
        headers = {"Authorization": f"Bearer {self.auth_token}"}

        self.log.debug(f"Checking whether repository exists: {repo_url}")
        try:
            repo = await client.fetch(repo_url, headers=headers)
            repo_exists = True
        except httpclient.HTTPError as e:
            if e.code == 404:
                repo_exists = False
            else:
                raise

        if repo_exists:
            repo_json = json.loads(repo.body.decode("utf-8"))
            self.log.debug(f"Repository exists: {repo_json}")
            return await super().get_image_manifest(image, tag)
        else:
            self.log.debug(f"Creating repository: {repo_url}")
            await client.fetch(repo_url, headers=headers, method="POST", body="")
            return None

c.BinderHub.registry_class = ExternalRegistryHelper

This only requires standard HTTP GET/POST calls and headers, the complex Oracle Cloud auth and API calls are hidden in the microservice.

betatim commented 1 year ago

That looks nice. All the vendor specific stuff is in one place, and the way to extend BinderHub is also not too ugly. A downside is that creating what you created requires quite a bit of knowledge of how BinderHub works, so it is probably beyond the average user's skills. Can/should we bundle the microservice in BinderHub's repo to make it (and others like it) more discoverable? Have a repo tag that is used to create a list in the docs?

manics commented 1 year ago

I'm going to see if I can get ECR actually working. If it does then I think incorporating some of the work into BinderHub will be helpful to admins: