hashicorp / consul-k8s

First-class support for Consul Service Mesh on Kubernetes
https://www.consul.io/docs/k8s
Mozilla Public License 2.0
669 stars 322 forks source link

gateway-resources service account missing imagePullSecrets #3862

Closed VladAlexF closed 3 months ago

VladAlexF commented 7 months ago

Community Note


Overview of the Issue

consul-k8s/charts/consul/templates/gateway-resources-serviceaccount.yaml is missing imagePullSecrets, which breaks the usage of private docker registries, as the Gateway Resources Job cannot pull the consul-k8s-control-plane image from private registries without these secrets.
Note, other service accounts do include the imagePullSecrets, and therefore other pods can successfully pull from the private registry.

Reproduction Steps

  1. Run a helm install with the following values.yaml file:
    global:
    imagePullSecrets: 
    - name: private-registry-pull-secret
    imageConsulDataplane: <private-dockerhub-proxy-cache>.com/dockerhub/hashicorp/consul-dataplane:latest
    imageK8S: <private-dockerhub-proxy-cache>.com/dockerhub/hashicorp/consul-k8s-control-plane:latest
    image: <private-dockerhub-proxy-cache>.com/dockerhub/hashicorp/consul:latest
  2. The <release-name>-gateway-resources job cannot launch containers, as it cannot pull the image from the private registry, due to missing imagePullSecrets on the service account the job uses.

Logs

The container cannot produce logs as it doesn't start, so kubernetes events for the pod from command kubectl -n consul describe pod consul-gateway-resources-2fz5z are provided:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  53m                    default-scheduler  Successfully assigned consul/consul-gateway-resources-2fz5z to k8s05
  Normal   Pulling    52m (x4 over 53m)      kubelet            Pulling image "<private-registry>.com/dockerhub/hashicorp/consul-k8s-control-plane:latest"
  Warning  Failed     52m (x4 over 53m)      kubelet            Failed to pull image "<private-registry>.com/dockerhub/hashicorp/consul-k8s-control-plane:latest": failed to pull and unpack image "<private-registry>.com/dockerhub/hashicorp/consul-k8s-control-plane:latest": failed to resolve reference "<private-registry>.com/dockerhub/hashicorp/consul-k8s-control-plane:latest": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials
  Warning  Failed     52m (x4 over 53m)      kubelet            Error: ErrImagePull
  Warning  Failed     51m (x6 over 53m)      kubelet            Error: ImagePullBackOff
  Normal   BackOff    3m24s (x218 over 53m)  kubelet            Back-off pulling image "<private-registry>.com/dockerhub/hashicorp/consul-k8s-control-plane:latest"

Expected behavior

The helm install can successfully pull images from the private registry, and run the gateway-resources job.

Environment details

pawellegowski89 commented 6 months ago

+1

pawellegowski89 commented 6 months ago

Additionally, a similar problem occurs after adding the CR API Gateway if we have images in a private registry:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: api-gateway
  namespace: consul
spec:
  gatewayClassName: consul
  listeners:
  ...

Once you add it, it creates itself ServiceAccount and deployment pointing to the ServiceAccount that invokes the pods of a given API Gateway. In the above ServiceAccount is also missing imagePullSecrets

Init Container (consul-connect-inject-init) can't pull image from private registry.

Init Containers:
  consul-connect-inject-init:
    Container ID:
    Image:         <private-registry>/consul-k8s-control-plane:1.4.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
      consul-k8s-control-plane connect-init -pod-name=${POD_NAME} -pod-namespace=${POD_NAMESPACE} \
        -gateway-kind="api-gateway" \
        -log-json=false \
        -service-account-name="my-own-api-gateway" \
        -service-name="my-own-api-gateway"
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False

ServiceAccount -service-account-name="my-own-api-gateway" does not contain imagePullSecrets

Affected version consul chart: 1.18.1

pawellegowski89 commented 3 months ago

Affected version consul chart: 1.19.0 too.

missylbytes commented 3 months ago

I was able to mostly recreate this locally. However the gateway-resources-job was able to start because other pods had already pulled the image. Are you seeing that same behavior, that the job will eventually run? Looks like this error is generally getting swallowed. Do you have a specific imagePullPolicy or something set?

~ Edited to say I think we should fix this, just want to know if it is blocking, since my cluster seems okay to come up.

Events from the gateway-resources-job: Type Reason Age From Message
Normal Scheduled 10s default-scheduler Successfully assigned consul/consul-gateway-resources-gsgbw to kind-control-plane
Normal Pulling 9s kubelet Pulling image "private-repo-image"
Warning Failed 5s kubelet Failed to pull image "private-repo-image": rpc error: code = Unknown desc = failed to pull and unpack image "private-repo-image": failed to resolve reference "private-repo-image": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
Warning Failed 5s kubelet Error: ErrImagePull
Normal Pulled 4s kubelet Container image "private-repo-image" already present on machine
Normal Created 4s kubelet Created container gateway-resources
Normal Started 3s kubelet Started container gateway-resources
pawellegowski89 commented 3 months ago

My case:

  1. Docker pull official images and add own tags
  2. Docker push official images with own tags to private registry (Azure container registry) which is secured (ImagePullSecret is required)
  3. Add secret on k8s - myregistry.azurecr.io-access (kubernetes.io/dockerconfigjson) - is valid
  4. Download official consul helm chart 1.5.1 (Consul 1.19.1)
  5. Set in values yaml:
global:
  datacenter: mycenter
  name: consul
  image: myregistry.azurecr.io/repo/release/consul:1.19.1
  imageK8S: myregistry.azurecr.io/repo/release/consul-k8s-control-plane:1.5.1
  imageConsulDataplane: myregistry.azurecr.io/repo/release/consul-dataplane:1.5.1
  imagePullSecrets:
    - name: myregistry.azurecr.io-access 
  1. Install this chart on k8s
  2. Job/server-consul-gateway-resources can't pull image from private registry:
$ k get job -A | grep consul
tool      server-consul-gateway-resources   0/1           30m        30m
tool      server-consul-server-acl-init     1/1           2m21s      30m
k describe pod  server-consul-gateway-resources-st4dp -n tool
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  12m                   default-scheduler  Successfully assigned common/server-consul-gateway-resources-st4dp to aks-nodepool1-mynodepool
  Warning  Failed     11m (x6 over 12m)     kubelet            Error: ImagePullBackOff
  Normal   Pulling    10m (x4 over 12m)     kubelet            Pulling image "myregistry.azurecr.io/repo/release/consul-k8s-control-plane:1.5.1"
  Warning  Failed     10m (x4 over 12m)     kubelet            Failed to pull image "myregistry.azurecr.io/repo/release/consul-k8s-control-plane:1.5.1":
 failed to pull and unpack image "myregistry.azurecr.io/repo/release/consul-k8s-control-plane:1.5.1": 
 failed to resolve reference "myregistry.azurecr.io/repo/release/consul-k8s-control-plane:1.5.1":
 failed to authorize: failed to fetch anonymous token:
 unexpected status from GET request to https://myregistry.azurecr.io/oauth2/token?scope=repository%3Arepo%2Frelease%2Fconsul-k8s-control-plane%3Apull&service=myregistry.azurecr.io: 401 Unauthorized
  Warning  Failed     10m (x4 over 12m)     kubelet            Error: ErrImagePull
  Normal   BackOff    2m11s (x45 over 12m)  kubelet            Back-off pulling image "myregistry.azurecr.io/repo/release/consul-k8s-control-plane:1.5.1"

Do you have a specific imagePullPolicy or something set? No.

missylbytes commented 3 months ago

Hi, one more update. I was able to recreate this issue completely by setting the imagePullPolicy to "Always". Is it possible your app/cluster sets it to "Always" be default? It may work to try setting it to "IfNotPresent" in the meantime as a workaround until our fix is in.

pawellegowski89 commented 1 month ago

In version 1.19.2 you fixed the bug only for pulling the consul-k8s-control-plane image, which is needed to init container.

The CR api gateway itself still has to pull its consul-dataplane docker image. This bug remains for it. This is because when we deploy CR with kind: Gateway, the correct deployment will be created with the correct pointing to serviceAccount, but unfortunately, this serviceAccount named <gateway-name>-gateway does not contain imagePullSecret in its definition.

command to find it:

kubectl get serviceAccount my-api-gateway -n ns

Invalid serviceAccount definition:

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    component: api-gateway
    gateway.consul.hashicorp.com/created: "1725890955"
    gateway.consul.hashicorp.com/managed: "true"
    gateway.consul.hashicorp.com/name: int-mesh-gateway
    gateway.consul.hashicorp.com/namespace: test
  name: my-api-gateway
  namespace: test
  ownerReferences:
  - apiVersion: gateway.networking.k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Gateway
    name: my-api-gateway   

I have been waiting for this fix for a long time, it is a pity that you fixed it only for the container init, and not for the api gateway container.

Logs in 1.19.2 version:

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  14m                   default-scheduler  Successfully assigned spoc/my-api-gateway-58c4f65b9c-82ktg to aks-nodepool1-myVM
  Normal   Pulled     14m                   kubelet            Pulling image "myregistry.azurecr.io/repo/release/consul-k8s-control-plane:1.5.3" 
  Normal   Created    14m                   kubelet            Created container consul-connect-inject-init
  Normal   Started    14m                   kubelet            Started container consul-connect-inject-init
  Normal   Pulling    13m (x4 over 14m)     kubelet            Pulling image "myregistry.azurecr.io/repo/release/consul-dataplane:1.5.3"
  Warning  Failed     13m (x4 over 14m)     kubelet            Failed to pull image "myregistry.azurecr.io/repo/release/consul-dataplane:1.5.3": failed to pull and unpack image "myregistry.azurecr.io/repo/release/consul-dataplane:1.5.3": failed to resolve reference "myregistry.azurecr.io/repo/release/consul-dataplane:1.5.3": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://myregistry.azurecr.io/oauth2/token?scope=repository%3Arelease%2Fconsul-dataplane%3Apull&service=myregistry.azurecr.io: 401 Unauthorized
  Warning  Failed     13m (x4 over 14m)     kubelet            Error: ErrImagePull
  Warning  Failed     13m (x5 over 14m)     kubelet            Error: ImagePullBackOff
  Normal   BackOff    4m39s (x42 over 14m)  kubelet            Back-off pulling image "myregistry.azurecr.io/repo/release/consul-dataplane:1.5.3"

Please test this properly in full scope. To do this you need to run chart with all images added to private registry with authorization, then deploy yaml file with custom CR definition -> for api gateway.

pawellegowski89 commented 1 month ago

I finally found the error:

https://github.com/hashicorp/consul-k8s/blob/v1.5.3/control-plane/gateways/serviceaccount.go

The ServiceAccount definition is missing the imagePullSecret if it was added in the helm chart under

global:
  image: myacr.azurecr.io/release/consul:1.19.2
  imageK8S:  myacr.azurecr.io/release/consul-k8s-control-plane:1.5.3
  imageConsulDataplane:  myacr.azurecr.io/release/consul-dataplane:1.5.3
  imagePullSecrets:
    - name: mspodemo.azurecr.io-access 
pawellegowski89 commented 1 month ago

I added the bug correctly as a new report as this is already closed.