Local image cache broken on cluster nodes

toaler commented 10 months ago

I've been following the documentation here on how to make an image accessible to pods spun up on worker nodes. I'm running kind version 0.20.0 and linux 5.15.0.

Here's what I'm doing:

On my linux host I first pull down the image (which happens to be loaded):

❯ docker pull quay.io/argoproj/argocd:v2.9.2 v2.9.2: Pulling from argoproj/argocd Digest: sha256:8576d347f30fa4c56a0129d1c0a0f5ed1e75662f0499f1ed7e917c405fd909dc Status: Image is up to date for quay.io/argoproj/argocd:v2.9.2 quay.io/argoproj/argocd:v2.9.2

Then I load it to all nodes in the kind cluster (called argo).

❯ kind load docker-image "quay.io/argoproj/argocd:v2.9.2" --name argo Image: "quay.io/argoproj/argocd:v2.9.2" with ID "sha256:8a4444aa1957ff41666caac0000f8a40c79b8df4a724b953ef00036fa32ee611" found to be already present on all nodes.

I can see the image on the worker node:

❯ docker exec -it argo-worker crictl images --no-trunc IMAGE TAG IMAGE ID SIZE docker.io/kindest/kindnetd v20230511-dc714da8 sha256:b0b1fa0f58c6e932b7f20bf208b2841317a1e8c88cc51b18358310bbd8ec95da 27.7MB docker.io/kindest/local-path-helper v20230510-486859a6 sha256:be300acfc86223548b4949398f964389b7309dfcfdcfc89125286359abb86956 3.05MB docker.io/kindest/local-path-provisioner v20230511-dc714da8 sha256:ce18e076e9d4b4283a79ef706170486225475fc4d64253710d94780fb6ec7627 19.4MB quay.io/argoproj/argocd v2.9.2 sha256:8a4444aa1957ff41666caac0000f8a40c79b8df4a724b953ef00036fa32ee611 433MB registry.k8s.io/coredns/coredns v1.10.1 sha256:ead0a4a53df89fd173874b46093b6e62d8c72967bbf606d672c9e8c9b601a4fc 16.2MB registry.k8s.io/etcd 3.5.7-0 sha256:86b6af7dd652c1b38118be1c338e9354b33469e69a218f7e290a0ca5304ad681 102MB registry.k8s.io/kube-apiserver v1.27.3 sha256:c604ff157f0cff86bfa45c67c76c949deaf48d8d68560fc4c456a319af5fd8fa 83.5MB registry.k8s.io/kube-controller-manager v1.27.3 sha256:9f8f3a9f3e8a9706694dd6d7a62abd1590034454974c31cd0e21c85cf2d3a1d5 74.4MB registry.k8s.io/kube-proxy v1.27.3 sha256:9d5429f6d7697ae3186f049e142875ba5854f674dfee916fa6c53da276808a23 72.7MB registry.k8s.io/kube-scheduler v1.27.3 sha256:205a4d549b94d37cc0e39e08cbf8871ffe2d7e7cbb6832e26713cd69ea1e2c4f 59.8MB registry.k8s.io/pause 3.7 sha256:221177c6082a88ea4f6240ab2450d540955ac6f4d5454f0e15751b653ebda165 311kB

However, when I try and pull the image I get the following error:

❯ docker exec -it argo-worker crictl pull quay.io/argoproj/argocd:v2.9.2 E1127 22:52:04.362129 1554 remote_image.go:167] "PullImage from image service failed" err="rpc error: code = DeadlineExceeded desc = failed to pull and unpack image \"quay.io/argoproj/argocd:v2.9.2\": failed to resolve reference \"quay.io/argoproj/argocd:v2.9.2\": failed to do request: Head \"https://quay.io/v2/argoproj/argocd/manifests/v2.9.2\": dial tcp 52.206.59.27:443: i/o timeout" image="quay.io/argoproj/argocd:v2.9.2" FATA[0030] pulling image: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "quay.io/argoproj/argocd:v2.9.2": failed to resolve reference "quay.io/argoproj/argocd:v2.9.2": failed to do request: Head "https://quay.io/v2/argoproj/argocd/manifests/v2.9.2": dial tcp 52.206.59.27:443: i/o timeout

I suspect that the lookup in the local registry is broken, as the error message leads me to believe the lookup failed locally, so it's falling back to pull the image from the remote image repo.

My questions are as follows:

Why is it missing on the local repo? Is this by design or is this a bug?
Is there any way to allow the docker node to be able to access the remote IP?

How do i proceed here?

kind version 0.20.0

BenTheElder commented 10 months ago

Why is it missing on the local repo? Is this by design or is this a bug?

Misunderstanding I think, it's not "in a local repo" it is in the local on-disk content store for the node's container runtime.

Loading it into the nodes makes the image available without pulling, it doesn't change the image to be pulled from a different location when actively requesting that the image be pulled.

The result of side-loading an image is as if the image was already pulled once before and then the internet was cut of.

The image will work for pods that do not set imagePullPolicy: Always (see also the notes at https://kind.sigs.k8s.io/docs/user/quick-start/#loading-an-image-into-your-cluster). Kubernetes will see the image is already available and skip pulling it. crictl pull tells it to pull the image even if it's already available.

If you want to use a local registry instead, see: https://kind.sigs.k8s.io/docs/user/local-registry/, which does involve the image being pulled to the nodes.

Is there any way to allow the docker node to be able to access the remote IP?

So, kind nodes should have internet access, but there may be complications with your host, such as a proxy or VPN or a firewall, and it's difficult to say which exactly without locally debugging.

toaler commented 10 months ago

Thanks for the quick response @BenTheElder. Totally my ignorance. Argo pod's had imagePullPolicy set to Always. After patching that images are acquired successfully.

I'm also experiencing a few strange problem on my linux machine (don't seem to have issues on OSX). The problems, based on worker node logs are related to various network operations timing out when executed by reflector.go. The data suggests there is a k8s/kind network infrastructure issue. I don't know enough to work out the root cause, but i'll pass on my observations.

First problem is I deployed busy box and trying to go interactive with the pod.

NAMESPACE            NAME                                                READY   STATUS             RESTARTS         AGE     IP           NODE                 NOMINATED NODE   READINESS GATES
argocd               argocd-application-controller-0                     0/1     Running            0                35m     10.244.1.5   argo-worker          <none>           <none>
argocd               argocd-applicationset-controller-597bf75d56-sgzxt   0/1     CrashLoopBackOff   10 (3m32s ago)   35m     10.244.1.3   argo-worker          <none>           <none>
argocd               argocd-dex-server-55cfc8845d-hxzmf                  1/1     Running            0                35m     10.244.2.2   argo-worker2         <none>           <none>
argocd               argocd-notifications-controller-95d5754cc-lw6nz     1/1     Running            0                35m     10.244.1.4   argo-worker          <none>           <none>
argocd               argocd-redis-68654cdcdf-d8tpx                       1/1     Running            0                35m     10.244.2.3   argo-worker2         <none>           <none>
argocd               argocd-repo-server-57d74b4bf5-4jrdz                 1/1     Running            0                35m     10.244.1.2   argo-worker          <none>           <none>
argocd               argocd-server-7b78574754-m89rd                      0/1     CrashLoopBackOff   11 (4m57s ago)   35m     10.244.2.4   argo-worker2         <none>           <none>
argocd               busybox-deployment-7774b7f54c-8w28g                 1/1     Running            0                5m33s   10.244.2.5   argo-worker2         <none>           <none>
default              service-a                                           0/1     ImagePullBackOff   0                30m     10.244.1.6   argo-worker          <none>           <none>
default              service-b                                           0/1     ImagePullBackOff   0                30m     10.244.1.7   argo-worker          <none>           <none>
kube-system          coredns-5d78c9869d-hlbws                            1/1     Running            0                37m     10.244.0.4   argo-control-plane   <none>           <none>
kube-system          coredns-5d78c9869d-n99lp                            1/1     Running            0                37m     10.244.0.2   argo-control-plane   <none>           <none>
kube-system          etcd-argo-control-plane                             1/1     Running            0                38m     172.19.0.2   argo-control-plane   <none>           <none>
kube-system          kindnet-dvrlc                                       1/1     Running            0                37m     172.19.0.3   argo-worker2         <none>           <none>
kube-system          kindnet-g5pn8                                       1/1     Running            0                37m     172.19.0.2   argo-control-plane   <none>           <none>
kube-system          kindnet-skdr4                                       1/1     Running            0                37m     172.19.0.4   argo-worker          <none>           <none>
kube-system          kube-apiserver-argo-control-plane                   1/1     Running            0                38m     172.19.0.2   argo-control-plane   <none>           <none>
kube-system          kube-controller-manager-argo-control-plane          1/1     Running            3 (25m ago)      38m     172.19.0.2   argo-control-plane   <none>           <none>
kube-system          kube-proxy-mr5b2                                    1/1     Running            0                37m     172.19.0.4   argo-worker          <none>           <none>
kube-system          kube-proxy-vbntl                                    1/1     Running            0                37m     172.19.0.2   argo-control-plane   <none>           <none>
kube-system          kube-proxy-w499d                                    1/1     Running            0                37m     172.19.0.3   argo-worker2         <none>           <none>
kube-system          kube-scheduler-argo-control-plane                   1/1     Running            3 (25m ago)      38m     172.19.0.2   argo-control-plane   <none>           <none>
local-path-storage   local-path-provisioner-6bc4bddd6b-nw4c9             1/1     Running            0                37m     10.244.0.3   argo-control-plane   <none>           <none>

An attempt to connect to the busybox-deployment-7774b7f54c-8w28g comes up short-handed:

$ kubectl exec -it busybox-deployment-7774b7f54c-8w28g -n argocd -- /bin/bash
Error from server: error dialing backend: dial tcp 172.19.0.3:10250: i/o timeout

Secondly, 3 pods in the argocd namespace are continuously failing all with the same readiness/liveness probe failures due to connect: connection refused. 10.244.1.5 is the IP of the pod.

│   Type     Reason     Age                  From               Message                                                                                                                                                                                     │
│   ----     ------     ----                 ----               -------                                                                                                                                                                                     │
│   Normal   Scheduled  41m                  default-scheduler  Successfully assigned argocd/argocd-application-controller-0 to argo-worker                                                                                                                 │
│   Normal   Pulled     41m                  kubelet            Container image "quay.io/argoproj/argocd:v2.9.2" already present on machine                                                                                                                 │
│   Normal   Created    41m                  kubelet            Created container argocd-application-controller                                                                                                                                             │
│   Normal   Started    40m                  kubelet            Started container argocd-application-controller                                                                                                                                             │
│   Warning  Unhealthy  59s (x271 over 40m)  kubelet            Readiness probe failed: Get "http://10.244.1.5:8082/healthz": dial tcp 10.244.1.5:8082: connect: connection refused

I can't connect to the pod or get logs to work out why the connection isn't being established. Looking at the logs on the worker nodes for the containers, all error messages seem to be network timeout related.

Here's logging for argocd-applicationset-controller-597bf75d56-sgzxt

root@argo-worker:/var/log/containers# cat argocd-applicationset-controller-597bf75d56-sgzxt_argocd_argocd-applicationset-controller-6d283c514b03e97141458a12b193add4e5014ee598044d002861828675765840.log
2023-11-28T07:39:11.30002586Z stderr F time="2023-11-28T07:39:11Z" level=info msg="ArgoCD ApplicationSet Controller is starting" built="2023-11-20T17:18:26Z" commit=c5ea5c4df52943a6fff6c0be181fde5358970304 namespace=argocd version=v2.9.2+c5ea5c4
2023-11-28T07:39:41.302273455Z stderr F time="2023-11-28T07:39:41Z" level=error msg="Get \"https://10.96.0.1:443/api?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeoutunable to start manager"

and here is a snippet of logging for argocd-application-controller-0

2023-11-28T06:52:09.050807483Z stderr F W1128 06:52:09.050375      13 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.Secret: Get "https://10.96.0.1:443/api/v1/namespaces/argocd/secrets?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
2023-11-28T06:52:09.050855657Z stderr F W1128 06:52:09.050420      13 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.ConfigMap: Get "https://10.96.0.1:443/api/v1/namespaces/argocd/configmaps?labelSelector=app.kubernetes.io%2Fpart-of%3Dargocd&limit=500&resourceVersion=0": di
al tcp 10.96.0.1:443: i/o timeout
2023-11-28T06:52:09.050863479Z stderr F W1128 06:52:09.050387      13 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.Deployment: Get "https://10.96.0.1:443/apis/apps/v1/namespaces/argocd/deployments?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
2023-11-28T06:52:09.050869499Z stderr F W1128 06:52:09.050553      13 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1alpha1.Application: Get "https://10.96.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argocd/applications?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: 
i/o timeout

Also the other thing that i can see different is the IP CIDR block used for kube-system vs argocd is different. argocd is using 10.244.0.0/24, where the kube-system pods are all in 172.19.0.0 range. Why is the caller of reflector.go using the IP 10.96.0.1? I'm assuming it's trying to call the k8s api server in the cluster, which has the address of 172.19.0.2?

 Cluster: kind-argo                                <1> default   <ctrl-d> Delete     <p>       Logs Previous                                                                                                                                                                                               |    |/ _/   __   \______ 
 User:    kind-argo                                              <d>      Describe   <shift-f> Port-Forward                                                                                                                                                                                                |      < \____    /  ___/ 
 K9s Rev: v0.25.18 ⚡️v0.28.2                                     <e>      Edit       <s>       Shell                                                                                                                                                                                                       |    |  \   /    /\___ \  
 K8s Rev: v1.27.3                                                <?>      Help       <f>       Show PortForward                                                                                                                                                                                            |____|__ \ /____//____  > 
 CPU:     n/a                                                    <ctrl-k> Kill       <y>       YAML                                                                                                                                                                                                                \/            \/  
 MEM:     n/a                                                                                                                                                                                                                                                                                                                        
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Pods(all)[23] ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ NAMESPACE↑                             NAME                                                                  PF                    READY                                         RESTARTS STATUS                                IP                              NODE                                    AGE                       │
│ argocd                                 argocd-dex-server-55cfc8845d-hxzmf                                    ●                     1/1                                                  0 Running                               10.244.2.2                      argo-worker2                            69m                       │
│ argocd                                 argocd-notifications-controller-95d5754cc-lw6nz                       ●                     1/1                                                  0 Running                               10.244.1.4                      argo-worker                             69m                       │
│ argocd                                 argocd-redis-68654cdcdf-d8tpx                                         ●                     1/1                                                  0 Running                               10.244.2.3                      argo-worker2                            69m                       │
│ argocd                                 argocd-repo-server-57d74b4bf5-4jrdz                                   ●                     1/1                                                  0 Running                               10.244.1.2                      argo-worker                             69m                       │
│ argocd                                 argocd-server-7b78574754-m89rd                                        ●                     0/1                                                 21 Running                               10.244.2.4                      argo-worker2                            69m                       │
│ argocd                                 busybox-deployment-7774b7f54c-8w28g                                   ●                     1/1                                                  0 Running                               10.244.2.5                      argo-worker2                            39m                       │
│ default                                service-a                                                             ●                     0/1                                                  0 ImagePullBackOff                      10.244.1.6                      argo-worker                             63m                       │
│ default                                service-b                                                             ●                     0/1                                                  0 ImagePullBackOff                      10.244.1.7                      argo-worker                             63m                       │
│ kube-system                            coredns-5d78c9869d-hlbws                                              ●                     1/1                                                  0 Running                               10.244.0.4                      argo-control-plane                      71m                       │
│ kube-system                            coredns-5d78c9869d-n99lp                                              ●                     1/1                                                  0 Running                               10.244.0.2                      argo-control-plane                      71m                       │
│ kube-system                            etcd-argo-control-plane                                               ●                     1/1                                                  0 Running                               172.19.0.2                      argo-control-plane                      71m                       │
│ kube-system                            kindnet-dvrlc                                                         ●                     1/1                                                  0 Running                               172.19.0.3                      argo-worker2                            71m                       │
│ kube-system                            kindnet-g5pn8                                                         ●                     1/1                                                  0 Running                               172.19.0.2                      argo-control-plane                      71m                       │
│ kube-system                            kindnet-skdr4                                                         ●                     1/1                                                  0 Running                               172.19.0.4                      argo-worker                             71m                       │
│ kube-system                            kube-apiserver-argo-control-plane                                     ●                     1/1                                                  0 Running                               172.19.0.2                      argo-control-plane                      71m                       │
│ kube-system                            kube-controller-manager-argo-control-plane                            ●                     1/1                                                  3 Running                               172.19.0.2                      argo-control-plane                      71m                       │
│ kube-system                            kube-proxy-mr5b2                                                      ●                     1/1                                                  0 Running                               172.19.0.4                      argo-worker                             71m                       │
│ kube-system                            kube-proxy-vbntl                                                      ●                     1/1                                                  0 Running                               172.19.0.2                      argo-control-plane                      71m                       │
│ kube-system                            kube-proxy-w499d                                                      ●                     1/1                                                  0 Running                               172.19.0.3                      argo-worker2                            71m                       │
│ kube-system                            kube-scheduler-argo-control-plane                                     ●                     1/1                                                  3 Running                               172.19.0.2                      argo-control-plane                      71m                       │
│ local-path-storage                     local-path-provisioner-6bc4bddd6b-nw4c9                               ●                     1/1                                                  0 Running                               10.244.0.3                      argo-control-plane                      71m                       │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

BenTheElder commented 10 months ago

10.96.0.1. is a very common k8s address, it's the first IP in the default clusterIP range for services with type cluster IP (~~virtual in-cluster IPs) implemented by kube-proxy, it's the api-server's in-cluster address.

As for networking issues on the linux host, again it's hard to say, could be something with the docker network or VPN or firewall or ....

toaler commented 10 months ago

After talking to @aojea the issue was related to the bridge-nf-call-* OS parameters being set to 1. Setting it to 0 resolves this.

https://github.com/kubernetes-sigs/kind/issues/2886#issuecomment-1219158523

sysctl net.bridge.bridge-nf-call-iptables=0 sysctl net.bridge.bridge-nf-call-arptables=0 sysctl net.bridge.bridge-nf-call-ip6tables=0

I'm thinking it would be useful on kind multi cluster create command execution to check these properties and give the operator feedback when any one of these are set to 1 and provide direction on how to resolve.

aojea commented 10 months ago

I'm thinking it would be useful on kind multi cluster create command execution to check these properties and give the operator feedback when any one of these are set to 1 and provide direction on how to resolve.

Docker communication between container will not work, is not really a kind issue, is a problem with the OS setup or the docker setup, kind is just a victim here ;)

toaler commented 10 months ago

Thanks for the clarification @aojea .

BenTheElder commented 10 months ago

I'm thinking it would be useful on kind multi cluster create command execution to check these properties and give the operator feedback when any one of these are set to 1 and provide direction on how to resolve.

kind may be talking to a remote daemon so this is not easy (setting aside whether we even should require this, purely from a feasibility perspective). this is common enough with e.g. docker desktop that we unfortunately cannot assume the local kind process is inspecting the environment in which the containers run. any checks we would run from the containers themselves are already in the entrypoint etc.

kubernetes-sigs / kind

Local image cache broken on cluster nodes #3435