jupyterhub / binderhub

Run your code in the cloud, with technology so advanced, it feels like magic!
https://binderhub.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.56k stars 390 forks source link

ACR support damaged by Gitlab registry change #1620

Open wallyhall opened 1 year ago

wallyhall commented 1 year ago

Bug description

I believe ACR support has been damaged by the Gitlab registry change in commit https://github.com/jupyterhub/binderhub/commit/f442f326320eaac68e08b663cac17055b615731a

Specifically this line:

## binderhub/registry.py:194

                url_concat(self.token_url, {"scope": "repository:{}:pull".format(image)}),
                url_concat(self.token_url, {
                    "scope": "repository:{}:pull".format(image),
                    "service": "container_registry"   # <-- this.  For ACRs, the documentation simply says inject what the specific CR nedes via token_url config ... for Gitlab, it's been hard-coded (and now breaks the instructions for ACRs).  That's my theory.
                }),

Expected behaviour

When following a Binderhub link, if a pre-built/cached image already exists in the ACR, I expect Binderhub to pull it, not rebuild from scratch.

Actual behaviour

Binderhub runs repo2docker and builds a new image, pushing it to the ACR.

The logs contain the following errors:

[I 230111 09:52:26 launcher:257] Starting server for user ****** with image ******.azurecr.io/******/******
[I 230111 09:52:29 builder:744] Launched ****** in 3s
[I 230111 09:53:02 log:135] 200 GET /v2/git/******/2cd69cb92c9fa36877786c4672158f343e2d1a79?urlpath=******.ipynb (anonymous@******) 2.64ms
[E 230111 09:53:02 builder:394] Tornado HTTP Timeout error: Failed to get image manifest for ******.azurecr.io/******/******:2cd69cb92c9fa36877786c4672158f343e2d1a79
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/binderhub/builder.py", line 388, in get
        image_manifest = await self.registry.get_image_manifest(
      File "/usr/local/lib/python3.9/site-packages/binderhub/registry.py", line 263, in get_image_manifest
        token = await self._get_token(
      File "/usr/local/lib/python3.9/site-packages/binderhub/registry.py", line 221, in _get_token
        auth_resp = await client.fetch(auth_req)
    tornado.httpclient.HTTPClientError: HTTP 400: Bad Request

I believe this is due to the token_url configured (shown below) being partially overwritten - specifically the ?service= by the linked git commit above...

config:
  DockerRegistry:
    token_url: https://********.azurecr.io/oauth2/token?service=********.azurecr.io

(I suspect it is becoming https://********.azurecr.io/oauth2/token?service=container_registry.)

How to reproduce

Follow the instructions in Binderhub's installation docs for an ACR, and attempt to launch the same repository multiple times.

Observe the logs (and behaviour).

Your personal set up

Have installed via Helm:

config:
  BinderHub:
    hub_url: https://hub.binder.*********
    hub_url_local: http://10.100.100.122
    image_prefix: *********.azurecr.io/*********/binder-prod-
    use_registry: true
    build_image: *********.azurecr.io/repo2docker-ssh:latest
  DockerRegistry:
    token_url: https://*********.azurecr.io/oauth2/token?service=*********.azurecr.io
  RateLimiter:
    period_seconds: 5
  KubernetesBuildExecutor:
    # this is vanilla repo2docker with the openssh client installed.
    build_image: *********.azurecr.io/repo2docker-ssh:latest

extraVolumes:
  - name: pvc-ssh-config
    persistentVolumeClaim:
      claimName: pvc-ssh-config
extraVolumeMounts:
  - name: pvc-ssh-config
    mountPath: /root/.ssh

imageBuilderType: pink

jupyterhub:
  hub:
    services:
      binder:
        apiToken: *********
  imagePullSecret:
    create: true
    password: *********
    registry: *********.azurecr.io
    username: *********
  proxy:
    secretToken: *********

registry:
  password: *********
  url: https://*********.azurecr.io
  username: *********

We are using Helm chart version 1.0.0-0.dev.git.2937.h0f65f33.

welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

manics commented 1 year ago

Maybe we should make _get_image_basename_and_tag part of the Registry class so it can be specialised for particular registry implementations instead of trying to have it work with everything.

manics commented 1 year ago

Can you suggest a fullname,basename,tag tuple we could add to this test? https://github.com/jupyterhub/binderhub/blob/31149279def7e0183c6b4be76a7b1e5b1d016361/binderhub/tests/test_builder.py#L6-L37 You can anonymise it, but ideally make it as close as possible to a real case.

wallyhall commented 1 year ago

Can you suggest a fullname,basename,tag tuple we could add to this test?

I think this would be correct as an ACR example:

         ( 
             "myrepo.azurecr.io/myproject/prefix-myimage:latest", 
             "myproject/prefix-myimage", 
             "latest", 
         ), 

I'm not sure (unless the proposed _get_image_basename_and_tag will include dealing with the retrieval of authentication tokens?) this will cover what I suspect the specific issue is though.

In the logs, I can see a 400/Bad Request being returned by Azure when in the _get_token method of registry.py:

[E 230111 09:53:02 builder:394] Tornado HTTP Timeout error: Failed to get image manifest for ******.azurecr.io/******/******:2cd69cb92c9fa36877786c4672158f343e2d1a79
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/binderhub/builder.py", line 388, in get
        image_manifest = await self.registry.get_image_manifest(
      File "/usr/local/lib/python3.9/site-packages/binderhub/registry.py", line 263, in get_image_manifest
        token = await self._get_token(
      File "/usr/local/lib/python3.9/site-packages/binderhub/registry.py", line 221, in _get_token
        auth_resp = await client.fetch(auth_req)
    tornado.httpclient.HTTPClientError: HTTP 400: Bad Request

I've been able to reproduce that in Postman using the following GET request:

https://binderhubacrpublic.azurecr.io/oauth2/token?service=binderhubacrpublic.azurecr.io&service=container_registry&scope=repository:IMAGE:pull

The URL has two service= query parameters, which I'm speculating is the result of the commit I linked in the description above. My speculation of why this is happening is based on the (confirmed) behaviour of tornado.httputil.url_concat as used inside registry.py / _get_token, and Binderhub documentation for using ACRs saying you must specify ?service=<repo>:

https://binderhub.readthedocs.io/en/latest/zero-to-binderhub/setup-binderhub.html#id3 (section 3.3.3):

```
token_url: "https://<ACR_NAME>.azurecr.io/oauth2/token?service=<ACR_NAME>.azurecr.io"
```

A "fix", I think, would be to drop the offending line from registry.py as shown below and instruct users of Gitlab to specify token_url=https://.../?service=container_registry (as like we're instructed to do for ACRs in the documentation linked above):

## binderhub/registry.py:194

                url_concat(self.token_url, {"scope": "repository:{}:pull".format(image)}),
                url_concat(self.token_url, {
                    "scope": "repository:{}:pull".format(image)
   #### problematic line removed ####
                }),

I hope that makes some semblance of sense. My brain is fried today!