nextstrain / cli

The Nextstrain command-line interface (CLI)—a program called nextstrain—which aims to provide a consistent way to run and visualize pathogen builds and access Nextstrain components like Augur and Auspice across computing environments such as Docker, Conda, and AWS Batch.
https://docs.nextstrain.org/projects/cli/
MIT License
27 stars 20 forks source link

Docker runner assumes the Docker Hub (docker.io) registry #279

Open tsibley opened 1 year ago

tsibley commented 1 year ago

Current Behavior

During update, the Docker runner assumes all images come from the Docker Hub (docker.io) registry.

Despite that, Docker images from other registries still mostly work unless a tag-less image name is given or a build-* tag is given. Either case triggers a different code path:

https://github.com/nextstrain/cli/blob/babc215092465758bf3df88ccf3eaae54add47c3/nextstrain/cli/runner/docker.py#L434-L440

and nextstrain.cli.runner.docker._update ends up trying to enumerate the build-* tags of the non-docker.io image by making requests to the docker.io registry. This understandably results in a 400 Bad Request from the latter. Oops.

Originally discovered in https://github.com/nextstrain/docker-base/pull/148#issuecomment-1539203279.

Expected behavior

Ideally the update code would fetch the tags available from the correct registry, but this is often complicated by forbidding of anonymous access and other authz issues.

As a temporary workaround, perhaps we could simply not support updates from other registries for now. This would involve detecting the registry correctly and excluding those which aren't docker.io from the build-* tag handling.

How to reproduce

The manylinux image here is used only as an example. Any image from a registry other than docker.io will have the same issue.

$ NEXTSTRAIN_HOME=$(mktemp -dt) NEXTSTRAIN_DOCKER_IMAGE=quay.io/pypa/manylinux_2_28_x86_64 nextstrain setup docker
Setting up docker…
Traceback (most recent call last):
  File "/home/tom/nextstrain/cli/.venv/bin/nextstrain", line 8, in <module>
    sys.exit(main())
  File "/home/tom/nextstrain/cli/nextstrain/cli/__main__.py", line 19, in main
    return cli.run( argv[1:] )
  File "/home/tom/nextstrain/cli/nextstrain/cli/__init__.py", line 36, in run
    return opts.__command__.run(opts)
  File "/home/tom/nextstrain/cli/nextstrain/cli/console.py", line 36, in decorated
    return f(*args, **kwargs)
  File "/home/tom/nextstrain/cli/nextstrain/cli/command/setup.py", line 59, in run
    setup_ok = opts.runner.setup(dry_run = opts.dry_run, force = opts.force)
  File "/home/tom/nextstrain/cli/nextstrain/cli/runner/docker.py", line 166, in setup
    if not setup_image(dry_run, force):
  File "/home/tom/nextstrain/cli/nextstrain/cli/runner/docker.py", line 186, in setup_image
    update_ok = _update(dry_run)
  File "/home/tom/nextstrain/cli/nextstrain/cli/runner/docker.py", line 308, in _update
    latest_image  = latest_build_image(current_image)
  File "/home/tom/nextstrain/cli/nextstrain/cli/runner/docker.py", line 436, in latest_build_image
    build_tags = sorted(filter(is_build_tag, tags(repository)))
  File "/home/tom/nextstrain/cli/nextstrain/cli/runner/docker.py", line 428, in tags
    "Authorization": "Bearer %s" % auth_token(repository),
  File "/home/tom/nextstrain/cli/nextstrain/cli/runner/docker.py", line 423, in auth_token
    return GET(url, params = params).json().get("token")
  File "/home/tom/nextstrain/cli/nextstrain/cli/runner/docker.py", line 415, in GET
    response.raise_for_status()
  File "/home/tom/nextstrain/cli/.venv/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://auth.docker.io/token?scope=repository%3Aquay.io%2Fpypa%2Fmanylinux_2_28_x86_64%3Apull&service=registry.docker.io
tsibley commented 1 year ago

Ah, this issue extends to any image name with an explicit registry in it, including docker.io when explicitly specified, because we treat it as part of the image name when constructing the API requests.

tsibley commented 1 year ago

The best thing to do here might be to (finally) switch to using docker-py instead of directly making the API requests (and potentially down the road, instead of invoking docker directly). The docker-py library will handle registries correctly by interfacing with the Docker daemon and existing config (I believe).

tsibley commented 1 year ago

Oh, hmm. Not quite with docker-py, as it doesn't have much in the way of support for interacting with registries. It has great support for interacting with the Docker daemon, which in turn can interact with registries, but not do things like list the tags for an image repository. On the command line, I'd use something like Skopeo, e.g.:

docker run \
  --rm \
  --interactive \
  --tty \
  --network host \
  -v "$HOME"/.docker/config.json:/config.json:ro \
    quay.io/skopeo/stable:v1 list-tags \
      --authfile /config.json \
      docker://docker.io/nextstrain/base

Ugh. I guess we could shell out to Skopeo inside a Docker container to get the tags?

tsibley commented 1 year ago

Preliminary patch for this which I'll turn into a proper PR eventually.

diff --git a/nextstrain/cli/runner/docker.py b/nextstrain/cli/runner/docker.py
index 0ffe971..b0f4733 100644
--- a/nextstrain/cli/runner/docker.py
+++ b/nextstrain/cli/runner/docker.py
@@ -409,6 +409,15 @@ def latest_build_image(image_name: str) -> str:

     >>> latest_build_image("nextstrain/base")
     'nextstrain/base:build-...'
+
+    When a registry other than docker.io (Docker Hub) is present, *image_name*
+    is passed through unchanged.
+
+    >>> latest_build_image("ghcr.io/nextstrain/base")
+    'ghcr.io/nextstrain/base'
+
+    >>> latest_build_image("ghcr.io/nextstrain/base:build-20220215T000459Z")
+    'ghcr.io/nextstrain/base:build-20220215T000459Z'
     """
     def GET(url, **kwargs):
         response = requests.get(url, **kwargs)
@@ -429,13 +438,17 @@ def latest_build_image(image_name: str) -> str:
         }
         return GET(url, headers = headers).json().get("tags", [])

-    repository, tag = split_image_name(image_name, implicit_latest = False)
+    registry, repository, tag = split_image_name(image_name, implicit_latest = False, split_registry = True)

     if not tag or is_build_tag(tag):
-        build_tags = sorted(filter(is_build_tag, tags(repository)))
+        if registry == "docker.io":
+            build_tags = sorted(filter(is_build_tag, tags(repository)))

-        if build_tags:
-            return repository + ":" + build_tags[-1]
+            if build_tags:
+                # Intentionally omit the default registry
+                return repository + ":" + build_tags[-1]
+        else:
+            warn(f"Image registry is not docker.io (Docker Hub); checking for newer 'build-…' tags is unsupported.")

     return image_name

diff --git a/nextstrain/cli/util.py b/nextstrain/cli/util.py
index 8e09c54..2b20c28 100644
--- a/nextstrain/cli/util.py
+++ b/nextstrain/cli/util.py
@@ -500,14 +500,14 @@ def byte_quantity(quantity: str) -> int:

 @overload
-def split_image_name(name: str, implicit_latest: Literal[True] = True) -> Tuple[str, str]:
+def split_image_name(name: str, implicit_latest: Literal[True] = True, split_registry: Literal[False] = False) -> Tuple[str, str]:
     ...

 @overload
-def split_image_name(name: str, implicit_latest: Literal[False]) -> Tuple[str, Optional[str]]:
+def split_image_name(name: str, implicit_latest: Literal[False], split_registry: Literal[True]) -> Tuple[str, str, Optional[str]]:
     ...

-def split_image_name(name: str, implicit_latest: bool = True) -> Tuple[str, Optional[str]]:
+def split_image_name(name: str, implicit_latest: bool = True, split_registry: bool = False) -> Union[Tuple[str, Optional[str]], Tuple[str, str, Optional[str]]]:
     """
     Split the Docker image *name* into a (repository, tag) tuple.

@@ -522,13 +522,49 @@ def split_image_name(name: str, implicit_latest: bool = True) -> Tuple[str, Opti

     >>> split_image_name("nextstrain/base:latest", implicit_latest = False)
     ('nextstrain/base', 'latest')
+
+    Or optionally into a (registry, repository, tag) tuple.
+
+    >>> split_image_name("nextstrain/base", split_registry = True)
+    ('docker.io', 'nextstrain/base', 'latest')
+
+    >>> split_image_name("ghcr.io/nextstrain/base", split_registry = True)
+    ('ghcr.io', 'nextstrain/base', 'latest')
+
+    >>> split_image_name("ghcr.io/nextstrain/base", split_registry = True, implicit_latest = False)
+    ('ghcr.io', 'nextstrain/base', None)
     """
     if ":" in name:
         repository, tag = name.split(":", maxsplit = 2)
     else:
         repository, tag = name, "latest" if implicit_latest else None

-    return (repository, tag)
+    if not split_registry:
+        return (repository, tag)
+
+    # Default registry is Docker Hub, which has an older alias.  While
+    # registry-1.docker.io and registry.hub.docker.com are also working
+    # domains, they aren't treated by Docker as equivalent to the default.
+    default_registry = "docker.io"
+    default_aliases = {"index.docker.io"}
+
+    registry = default_registry
+
+    # Rules for determining if the first part is a registry domain or not.
+    # <https://github.com/distribution/distribution/blob/f7717b78/reference/normalize.go#L123-L140>
+    if "/" in repository:
+        parts = repository.split("/", 1)
+
+        if ("." in parts[0]
+        or ":" in parts[0]
+        or parts[0] == "localhost"
+        or parts[0] != parts[0].lower()):
+            registry, repository = parts
+
+            if registry in default_aliases:
+                registry = default_registry
+
+    return (registry, repository, tag)

 def glob_matcher(patterns: Sequence[str]) -> Callable[[Union[str, Path]], bool]: