giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

All `zot` usage scenarios #3460

Closed piontec closed 1 month ago

piontec commented 3 months ago

Definition of done list

### Tasks
- [x] Make sure that the cases below work and are supported by all the involved components
- [x] Support local registry cache in `cluster` shared chart: <https://github.com/giantswarm/cluster/releases/tag/v0.32.0>
- [x] Support local registry cache in `cluster-aws` from: <https://github.com/giantswarm/cluster-aws/releases/tag/v0.79.0>
- [x] Support local registry cache in `cluster-azure` from: <https://github.com/giantswarm/cluster-azure/releases/tag/v0.13.0>
- [x] Support local registry cache in `cluster-cloud-director` from: <https://github.com/giantswarm/cluster-cloud-director/releases/tag/v0.54.0>
- [x] Support local registry cache in `cluster-vsphere` from: <https://github.com/giantswarm/cluster-vsphere/releases/tag/v0.54.0>
- [x] Authenticate our default zot in pull-through mode on MCs with `gsoci.azurecr.io`, as a defense against resource usage attacks on `gsoci` (we might need to disable anonymous access on `gsoci`) (See: https://github.com/giantswarm/roadmap/issues/3460#issuecomment-2191858665)
- [x] Create basic metrics, alerts and related ops recipes (https://github.com/giantswarm/prometheus-rules/pull/1257 + https://github.com/giantswarm/giantswarm/pull/31092, available in: https://github.com/giantswarm/prometheus-rules/releases/tag/v4.4.0)
- [x] Create/update internal docs explaining how the on-MC-pull-through `zot` works, how it interacts with upstream registries and how it's configured in the MC and in WCs (https://intranet.giantswarm.io/docs/dev-and-releng/zot/overview/)
- [x] Create public docs (https://docs.giantswarm.io/tutorials/registry/zot/) 
- [x] Figure out a strategy to upgrade existing clusters to use on-MC-pull-through zot
- [x] Discuss and decide the approach to `gsociprivate` (See: https://intranet.giantswarm.io/docs/dev-and-releng/zot/future-development/)
- [x] How do make new WCs use the on-MC-pull-through cache by default (See: https://github.com/giantswarm/roadmap/issues/3460#issuecomment-2202162469)
- [x] Support in `mc-bootstrap` to create the default setup for new CAPx MCs (https://github.com/giantswarm/mc-bootstrap/pull/939/)

Zot use cases

We were discussing different modes we want to use zot with. This ticket clarifies them and separates them into use cases.

1. zot on a MC

1.1. zot on a MC as a cache for gsoci registry

We deploy an instance of zot on the MC. It is configured as below:

To use this cache, the following configuration has to be applied (and its application made possible) with the cluster apps (containerd's) config:

1.2. zot on a MC as a cache for customer specific images

We deploy an extra 2nd instance of zot on the MC, according to the config provided by a customer.

CAPI cluster apps have to support multiple registries configuration, including an optional auth, so for example, the following is possible (containerds config):

[!NOTE] If we find 2 separate instances of zot on MC hard to run or a waste of resources, we will reconsider having just 1 instance shared with customers.

2. zot on WC

We deploy zot on the WC itself. We consider this an opt-in solution. To make it work, we expose zot's Service as a NodePort on localhost, then we make the cluster App configuration to point to it, using HTTP, localhost and anonymous access. To make this work, we need to configure two parts of the WC deployments:

This is a pretty elastic setup, but we already know we want to cover some more specific use cases with it.

2.1. zot as edge WC's cache

Zot has to be deployed by default. We configure it to actively replicate a selected set of images (everything needed to start a new cluster node, from both Giant Swarm registry and customer's as well) and lazily cache every other image pulled.

2.2. zot on normal clusters

zot is an opt-in and can be configured arbitrarily, but it has to have an entry for the gsoci.azurecr.io registry, either pointing to the MC's zot or directly to the upstream registry

uvegla commented 2 months ago

Test results

See: https://github.com/giantswarm/giantswarm/issues/30596 for a lot of example manifests used for testing.

1. zot on a MC

1.1. zot on a MC as a cache for gsoci registry

Zot can be configured to have authentication towards gsoci via its secret files mechanism and config.json.

apiVersion: v1
kind: Secret
metadata:
  name: zot-test-app-secret
  namespace: giantswarm
stringData:
  secret-values.yaml: |-
    mountSecret: true
    secretFiles:
      htpasswd: |-
        prom:<HTPASSWD_PROM_PASSWORD>
        admin:<HTPASSWD_ADMIN_PASSWORD>
        example:<HTPASSWD_EXAMPLE_PASSWORD>
      authenticatedRegistries: |-
        {
          "gsoci.azurecr.io": {
            "username": "<GSOCI_PRIVATE_USER>",
            "password": "<GSOCI_PRIVATE_TOKEN>"
          }
        }
    serviceMonitor:
      basicAuth:
        username: prom
        password: <PROM_PASSWORD>

And part of the config.json:

# ...
          "extensions": {
            "sync": {
              "enable": true,
              "credentialsFile": "/secret/authenticatedRegistries",
              "registries": [
                {
                  "urls": [
                    "https://gsoci.azurecr.io"
                  ],
                  "onDemand": true,
                  "tlsVerify": true,
                  "maxRetries": 3,
                  "retryDelay": "5m"
                },
# ...

The containerd configuration should be set via the cluster-PROVIDER chart values e.g. cluster-aws. In this case, the MC Zot ingress is public so no credentials are needed for the containerd configuration, just setting the mirrors to e.g. zot.MC_NAME.gaws.gigantic.io + set gsoci itself as a fallback with no auth. This way if we need to disable anonymous auth for the registry itself, the fallback will not work but the primary is the MC Zot that connects to the registry authenticated already. The MC Zot is easier to rotate for the keys for example compared to rotating the nodes in the WC to change the containerd configuration.

Experienced no issues with Zot as a pull-through cache.

1.2. zot on a MC as a cache for customer specific images

Had no issues deploying a secondary Zot on the MC. Created a WC with a secondary, fully authenticated Zot as well.

There are some bottlenecks in the cluster (https://github.com/giantswarm/cluster) chart tho about setting authenticated registries. See: https://github.com/giantswarm/roadmap/issues/3491 + https://gigantic.slack.com/archives/C0559SH3RJ4/p1716992063720939

2. Zot on WC

2.1. zot as edge WC's cache

Tested and worked with this containerd config changes in cluster chart: https://github.com/giantswarm/cluster/pull/178 + tested with: https://github.com/giantswarm/cluster-aws/pull/620

⚠️ To be deployed by default to new WCs we still have to figure out the delivery mechanism. I feel against cluster-PROVIDER releases cos that ties us to cluster upgrades to make changes to Zot. Remains to be discussed at the time of writing.

2.2. zot on normal clusters

It seems just a different permutation of the above settings.

uvegla commented 2 months ago

About authentication

The registry must be Standard or Premium to support anonymous access: https://learn.microsoft.com/en-gb/azure/container-registry/container-registry-skus#service-tier-features-and-limits

It is not enabled by defaults. Enable / disable via CLI: https://learn.microsoft.com/en-us/azure/container-registry/anonymous-pull-access

Pull and metadata read access is needed for the token for Zot to sync images: Screenshot 2024-06-26 at 16 23 33

After that when Zot is using creds, anonym access obviously will not matter to it.

Tried tho with active syncing that Zot does replicates other registries fine if one has auth issues.

To create a token via CLI per MC, for example via mc-bootstrap, see: https://learn.microsoft.com/en-us/azure/container-registry/container-registry-repository-scoped-permissions#create-token---cli

uvegla commented 1 month ago

About default containerd config for WC to use MC Zot as pull through cache

for the containerd config to all WCs for using MC Zot: cluster and cluster-test catalog values, for example:

global:
  components:
    containerd:
      containerRegistries:
        gsoci.azurecr.io:
          - endpoint: zot.<managementCluster>.<baseDomain OR global.connectivity.baseDomain>
          - endpoint: gsoci.azurecr.io

This means duplicating some of the values sort of, but they are in the same place.

Also by the nature of it, this is sort of staged rollout on the per MC level. If needs to be per WC level, we could add an extra config or something per WC and when all done, move to catalog CM and remove the extra configs as clean up. But why would that be a problem, since we have hardly any production CAPx MC/WC. This would also imply that the sooner we do this the better.

One "issue" I can think of - since we are usually afraid of it for whatever reason - is that the the MC cluster app is from these catalogs as well so the MC itself will roll too I think. Which is again probably even desirable so it uses the registry cache for itself as well.

I double checked and if a customer / we want additional containerd config, it will be merged together since .global.components.containerd.containerRegistries is a list of objects. Drawback: if you want extra stuff but dont want the MC zot cache, then you cant do it. It is kind of how it "always has been" tho, so welp. Also why anyone would not want it?

Another "drawback" that comes to mind is that it might be problematic to create older cluster versions if they have schema validation enabled and they do not have these values yet. The new value however is the local cache support, .global.components.containerd.containerRegistries existed for a long time if I am not mistaken.

mproffitt commented 1 month ago

This is now complete from Honeybadger