Closed toaler closed 10 months ago
Why is it missing on the local repo? Is this by design or is this a bug?
Misunderstanding I think, it's not "in a local repo" it is in the local on-disk content store for the node's container runtime.
Loading it into the nodes makes the image available without pulling, it doesn't change the image to be pulled from a different location when actively requesting that the image be pulled.
The result of side-loading an image is as if the image was already pulled once before and then the internet was cut of.
The image will work for pods that do not set imagePullPolicy: Always
(see also the notes at https://kind.sigs.k8s.io/docs/user/quick-start/#loading-an-image-into-your-cluster). Kubernetes will see the image is already available and skip pulling it. crictl pull
tells it to pull the image even if it's already available.
If you want to use a local registry instead, see: https://kind.sigs.k8s.io/docs/user/local-registry/, which does involve the image being pulled to the nodes.
Is there any way to allow the docker node to be able to access the remote IP?
So, kind nodes should have internet access, but there may be complications with your host, such as a proxy or VPN or a firewall, and it's difficult to say which exactly without locally debugging.
Thanks for the quick response @BenTheElder. Totally my ignorance. Argo pod's had imagePullPolicy set to Always. After patching that images are acquired successfully.
I'm also experiencing a few strange problem on my linux machine (don't seem to have issues on OSX). The problems, based on worker node logs are related to various network operations timing out when executed by reflector.go. The data suggests there is a k8s/kind network infrastructure issue. I don't know enough to work out the root cause, but i'll pass on my observations.
First problem is I deployed busy box and trying to go interactive with the pod.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
argocd argocd-application-controller-0 0/1 Running 0 35m 10.244.1.5 argo-worker <none> <none>
argocd argocd-applicationset-controller-597bf75d56-sgzxt 0/1 CrashLoopBackOff 10 (3m32s ago) 35m 10.244.1.3 argo-worker <none> <none>
argocd argocd-dex-server-55cfc8845d-hxzmf 1/1 Running 0 35m 10.244.2.2 argo-worker2 <none> <none>
argocd argocd-notifications-controller-95d5754cc-lw6nz 1/1 Running 0 35m 10.244.1.4 argo-worker <none> <none>
argocd argocd-redis-68654cdcdf-d8tpx 1/1 Running 0 35m 10.244.2.3 argo-worker2 <none> <none>
argocd argocd-repo-server-57d74b4bf5-4jrdz 1/1 Running 0 35m 10.244.1.2 argo-worker <none> <none>
argocd argocd-server-7b78574754-m89rd 0/1 CrashLoopBackOff 11 (4m57s ago) 35m 10.244.2.4 argo-worker2 <none> <none>
argocd busybox-deployment-7774b7f54c-8w28g 1/1 Running 0 5m33s 10.244.2.5 argo-worker2 <none> <none>
default service-a 0/1 ImagePullBackOff 0 30m 10.244.1.6 argo-worker <none> <none>
default service-b 0/1 ImagePullBackOff 0 30m 10.244.1.7 argo-worker <none> <none>
kube-system coredns-5d78c9869d-hlbws 1/1 Running 0 37m 10.244.0.4 argo-control-plane <none> <none>
kube-system coredns-5d78c9869d-n99lp 1/1 Running 0 37m 10.244.0.2 argo-control-plane <none> <none>
kube-system etcd-argo-control-plane 1/1 Running 0 38m 172.19.0.2 argo-control-plane <none> <none>
kube-system kindnet-dvrlc 1/1 Running 0 37m 172.19.0.3 argo-worker2 <none> <none>
kube-system kindnet-g5pn8 1/1 Running 0 37m 172.19.0.2 argo-control-plane <none> <none>
kube-system kindnet-skdr4 1/1 Running 0 37m 172.19.0.4 argo-worker <none> <none>
kube-system kube-apiserver-argo-control-plane 1/1 Running 0 38m 172.19.0.2 argo-control-plane <none> <none>
kube-system kube-controller-manager-argo-control-plane 1/1 Running 3 (25m ago) 38m 172.19.0.2 argo-control-plane <none> <none>
kube-system kube-proxy-mr5b2 1/1 Running 0 37m 172.19.0.4 argo-worker <none> <none>
kube-system kube-proxy-vbntl 1/1 Running 0 37m 172.19.0.2 argo-control-plane <none> <none>
kube-system kube-proxy-w499d 1/1 Running 0 37m 172.19.0.3 argo-worker2 <none> <none>
kube-system kube-scheduler-argo-control-plane 1/1 Running 3 (25m ago) 38m 172.19.0.2 argo-control-plane <none> <none>
local-path-storage local-path-provisioner-6bc4bddd6b-nw4c9 1/1 Running 0 37m 10.244.0.3 argo-control-plane <none> <none>
An attempt to connect to the busybox-deployment-7774b7f54c-8w28g
comes up short-handed:
$ kubectl exec -it busybox-deployment-7774b7f54c-8w28g -n argocd -- /bin/bash
Error from server: error dialing backend: dial tcp 172.19.0.3:10250: i/o timeout
Secondly, 3 pods in the argocd namespace are continuously failing all with the same readiness/liveness probe failures due to connect: connection refused
. 10.244.1.5
is the IP of the pod.
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 41m default-scheduler Successfully assigned argocd/argocd-application-controller-0 to argo-worker │
│ Normal Pulled 41m kubelet Container image "quay.io/argoproj/argocd:v2.9.2" already present on machine │
│ Normal Created 41m kubelet Created container argocd-application-controller │
│ Normal Started 40m kubelet Started container argocd-application-controller │
│ Warning Unhealthy 59s (x271 over 40m) kubelet Readiness probe failed: Get "http://10.244.1.5:8082/healthz": dial tcp 10.244.1.5:8082: connect: connection refused
I can't connect to the pod or get logs to work out why the connection isn't being established. Looking at the logs on the worker nodes for the containers, all error messages seem to be network timeout related.
Here's logging for argocd-applicationset-controller-597bf75d56-sgzxt
root@argo-worker:/var/log/containers# cat argocd-applicationset-controller-597bf75d56-sgzxt_argocd_argocd-applicationset-controller-6d283c514b03e97141458a12b193add4e5014ee598044d002861828675765840.log
2023-11-28T07:39:11.30002586Z stderr F time="2023-11-28T07:39:11Z" level=info msg="ArgoCD ApplicationSet Controller is starting" built="2023-11-20T17:18:26Z" commit=c5ea5c4df52943a6fff6c0be181fde5358970304 namespace=argocd version=v2.9.2+c5ea5c4
2023-11-28T07:39:41.302273455Z stderr F time="2023-11-28T07:39:41Z" level=error msg="Get \"https://10.96.0.1:443/api?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeoutunable to start manager"
and here is a snippet of logging for argocd-application-controller-0
2023-11-28T06:52:09.050807483Z stderr F W1128 06:52:09.050375 13 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.Secret: Get "https://10.96.0.1:443/api/v1/namespaces/argocd/secrets?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
2023-11-28T06:52:09.050855657Z stderr F W1128 06:52:09.050420 13 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.ConfigMap: Get "https://10.96.0.1:443/api/v1/namespaces/argocd/configmaps?labelSelector=app.kubernetes.io%2Fpart-of%3Dargocd&limit=500&resourceVersion=0": di
al tcp 10.96.0.1:443: i/o timeout
2023-11-28T06:52:09.050863479Z stderr F W1128 06:52:09.050387 13 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.Deployment: Get "https://10.96.0.1:443/apis/apps/v1/namespaces/argocd/deployments?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
2023-11-28T06:52:09.050869499Z stderr F W1128 06:52:09.050553 13 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1alpha1.Application: Get "https://10.96.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argocd/applications?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443:
i/o timeout
Also the other thing that i can see different is the IP CIDR block used for kube-system vs argocd is different. argocd is using 10.244.0.0/24, where the kube-system pods are all in 172.19.0.0 range. Why is the caller of reflector.go using the IP 10.96.0.1
? I'm assuming it's trying to call the k8s api server in the cluster, which has the address of 172.19.0.2?
Cluster: kind-argo <1> default <ctrl-d> Delete <p> Logs Previous | |/ _/ __ \______
User: kind-argo <d> Describe <shift-f> Port-Forward | < \____ / ___/
K9s Rev: v0.25.18 ⚡️v0.28.2 <e> Edit <s> Shell | | \ / /\___ \
K8s Rev: v1.27.3 <?> Help <f> Show PortForward |____|__ \ /____//____ >
CPU: n/a <ctrl-k> Kill <y> YAML \/ \/
MEM: n/a
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Pods(all)[23] ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ NAMESPACE↑ NAME PF READY RESTARTS STATUS IP NODE AGE │
│ argocd argocd-dex-server-55cfc8845d-hxzmf ● 1/1 0 Running 10.244.2.2 argo-worker2 69m │
│ argocd argocd-notifications-controller-95d5754cc-lw6nz ● 1/1 0 Running 10.244.1.4 argo-worker 69m │
│ argocd argocd-redis-68654cdcdf-d8tpx ● 1/1 0 Running 10.244.2.3 argo-worker2 69m │
│ argocd argocd-repo-server-57d74b4bf5-4jrdz ● 1/1 0 Running 10.244.1.2 argo-worker 69m │
│ argocd argocd-server-7b78574754-m89rd ● 0/1 21 Running 10.244.2.4 argo-worker2 69m │
│ argocd busybox-deployment-7774b7f54c-8w28g ● 1/1 0 Running 10.244.2.5 argo-worker2 39m │
│ default service-a ● 0/1 0 ImagePullBackOff 10.244.1.6 argo-worker 63m │
│ default service-b ● 0/1 0 ImagePullBackOff 10.244.1.7 argo-worker 63m │
│ kube-system coredns-5d78c9869d-hlbws ● 1/1 0 Running 10.244.0.4 argo-control-plane 71m │
│ kube-system coredns-5d78c9869d-n99lp ● 1/1 0 Running 10.244.0.2 argo-control-plane 71m │
│ kube-system etcd-argo-control-plane ● 1/1 0 Running 172.19.0.2 argo-control-plane 71m │
│ kube-system kindnet-dvrlc ● 1/1 0 Running 172.19.0.3 argo-worker2 71m │
│ kube-system kindnet-g5pn8 ● 1/1 0 Running 172.19.0.2 argo-control-plane 71m │
│ kube-system kindnet-skdr4 ● 1/1 0 Running 172.19.0.4 argo-worker 71m │
│ kube-system kube-apiserver-argo-control-plane ● 1/1 0 Running 172.19.0.2 argo-control-plane 71m │
│ kube-system kube-controller-manager-argo-control-plane ● 1/1 3 Running 172.19.0.2 argo-control-plane 71m │
│ kube-system kube-proxy-mr5b2 ● 1/1 0 Running 172.19.0.4 argo-worker 71m │
│ kube-system kube-proxy-vbntl ● 1/1 0 Running 172.19.0.2 argo-control-plane 71m │
│ kube-system kube-proxy-w499d ● 1/1 0 Running 172.19.0.3 argo-worker2 71m │
│ kube-system kube-scheduler-argo-control-plane ● 1/1 3 Running 172.19.0.2 argo-control-plane 71m │
│ local-path-storage local-path-provisioner-6bc4bddd6b-nw4c9 ● 1/1 0 Running 10.244.0.3 argo-control-plane 71m │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
10.96.0.1. is a very common k8s address, it's the first IP in the default clusterIP range for services with type cluster IP (~~virtual in-cluster IPs) implemented by kube-proxy, it's the api-server's in-cluster address.
As for networking issues on the linux host, again it's hard to say, could be something with the docker network or VPN or firewall or ....
After talking to @aojea the issue was related to the bridge-nf-call-* OS parameters being set to 1. Setting it to 0 resolves this.
https://github.com/kubernetes-sigs/kind/issues/2886#issuecomment-1219158523
sysctl net.bridge.bridge-nf-call-iptables=0 sysctl net.bridge.bridge-nf-call-arptables=0 sysctl net.bridge.bridge-nf-call-ip6tables=0
I'm thinking it would be useful on kind multi cluster create command execution to check these properties and give the operator feedback when any one of these are set to 1 and provide direction on how to resolve.
I'm thinking it would be useful on kind multi cluster create command execution to check these properties and give the operator feedback when any one of these are set to 1 and provide direction on how to resolve.
Docker communication between container will not work, is not really a kind issue, is a problem with the OS setup or the docker setup, kind is just a victim here ;)
Thanks for the clarification @aojea .
I'm thinking it would be useful on kind multi cluster create command execution to check these properties and give the operator feedback when any one of these are set to 1 and provide direction on how to resolve.
kind may be talking to a remote daemon so this is not easy (setting aside whether we even should require this, purely from a feasibility perspective). this is common enough with e.g. docker desktop that we unfortunately cannot assume the local kind process is inspecting the environment in which the containers run. any checks we would run from the containers themselves are already in the entrypoint etc.
I've been following the documentation here on how to make an image accessible to pods spun up on worker nodes. I'm running kind version 0.20.0 and linux 5.15.0.
Here's what I'm doing:
On my linux host I first pull down the image (which happens to be loaded):
❯ docker pull quay.io/argoproj/argocd:v2.9.2 v2.9.2: Pulling from argoproj/argocd Digest: sha256:8576d347f30fa4c56a0129d1c0a0f5ed1e75662f0499f1ed7e917c405fd909dc Status: Image is up to date for quay.io/argoproj/argocd:v2.9.2 quay.io/argoproj/argocd:v2.9.2
Then I load it to all nodes in the kind cluster (called argo).
❯ kind load docker-image "quay.io/argoproj/argocd:v2.9.2" --name argo Image: "quay.io/argoproj/argocd:v2.9.2" with ID "sha256:8a4444aa1957ff41666caac0000f8a40c79b8df4a724b953ef00036fa32ee611" found to be already present on all nodes.
I can see the image on the worker node:
❯ docker exec -it argo-worker crictl images --no-trunc IMAGE TAG IMAGE ID SIZE docker.io/kindest/kindnetd v20230511-dc714da8 sha256:b0b1fa0f58c6e932b7f20bf208b2841317a1e8c88cc51b18358310bbd8ec95da 27.7MB docker.io/kindest/local-path-helper v20230510-486859a6 sha256:be300acfc86223548b4949398f964389b7309dfcfdcfc89125286359abb86956 3.05MB docker.io/kindest/local-path-provisioner v20230511-dc714da8 sha256:ce18e076e9d4b4283a79ef706170486225475fc4d64253710d94780fb6ec7627 19.4MB quay.io/argoproj/argocd v2.9.2 sha256:8a4444aa1957ff41666caac0000f8a40c79b8df4a724b953ef00036fa32ee611 433MB registry.k8s.io/coredns/coredns v1.10.1 sha256:ead0a4a53df89fd173874b46093b6e62d8c72967bbf606d672c9e8c9b601a4fc 16.2MB registry.k8s.io/etcd 3.5.7-0 sha256:86b6af7dd652c1b38118be1c338e9354b33469e69a218f7e290a0ca5304ad681 102MB registry.k8s.io/kube-apiserver v1.27.3 sha256:c604ff157f0cff86bfa45c67c76c949deaf48d8d68560fc4c456a319af5fd8fa 83.5MB registry.k8s.io/kube-controller-manager v1.27.3 sha256:9f8f3a9f3e8a9706694dd6d7a62abd1590034454974c31cd0e21c85cf2d3a1d5 74.4MB registry.k8s.io/kube-proxy v1.27.3 sha256:9d5429f6d7697ae3186f049e142875ba5854f674dfee916fa6c53da276808a23 72.7MB registry.k8s.io/kube-scheduler v1.27.3 sha256:205a4d549b94d37cc0e39e08cbf8871ffe2d7e7cbb6832e26713cd69ea1e2c4f 59.8MB registry.k8s.io/pause 3.7 sha256:221177c6082a88ea4f6240ab2450d540955ac6f4d5454f0e15751b653ebda165 311kB
However, when I try and pull the image I get the following error:
❯ docker exec -it argo-worker crictl pull quay.io/argoproj/argocd:v2.9.2 E1127 22:52:04.362129 1554 remote_image.go:167] "PullImage from image service failed" err="rpc error: code = DeadlineExceeded desc = failed to pull and unpack image \"quay.io/argoproj/argocd:v2.9.2\": failed to resolve reference \"quay.io/argoproj/argocd:v2.9.2\": failed to do request: Head \"https://quay.io/v2/argoproj/argocd/manifests/v2.9.2\": dial tcp 52.206.59.27:443: i/o timeout" image="quay.io/argoproj/argocd:v2.9.2" FATA[0030] pulling image: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "quay.io/argoproj/argocd:v2.9.2": failed to resolve reference "quay.io/argoproj/argocd:v2.9.2": failed to do request: Head "https://quay.io/v2/argoproj/argocd/manifests/v2.9.2": dial tcp 52.206.59.27:443: i/o timeout
I suspect that the lookup in the local registry is broken, as the error message leads me to believe the lookup failed locally, so it's falling back to pull the image from the remote image repo.
My questions are as follows:
How do i proceed here?
kind version 0.20.0