rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.57k stars 268 forks source link

[Backport release-1.23] Tarball airgap install fails on current RC #3114

Closed rancherbot closed 2 years ago

rancherbot commented 2 years ago

This is a backport issue for https://github.com/rancher/rke2/issues/3113, automatically created via rancherbot by @rancher-max

Original issue description:

Environmental Info: RKE2 Version:

v1.24.2-rc1+rke2r1

Node(s) CPU architecture, OS, and Version:

ubuntu 20.04 LTS

Cluster Configuration:

1 server 1 agent

Describe the bug:

Cannot install rke2 in airgap using tarball method

Steps To Reproduce:

Expected behavior:

rke2 should be running successfully and the cluster should be up and running

Actual behavior:

rke2 fails to run, giving a fatal error (see below)

Additional context / logs:

Log taken from a node that was attempting to run with calico also, so had multiple tar files present, but saw the same behavior in the minimal reproduction steps shown above.

$ sudo rke2 server --write-kubeconfig-mode 644 --cni=calico --debug
WARN[0000] not running in CIS mode                      
INFO[0000] Starting rke2 v1.24.2-rc1+rke2r1 (c943ed52e7fb92fb2da27608bb5848e34a807ac9) 
INFO[0000] Managed etcd cluster initializing            
INFO[0000] Starting etcd for new cluster                
INFO[0000] Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,rke2 --authorization-mode=Node,RBAC --bind-address=0.0.0.0 --cert-dir=/var/lib/rancher/rke2/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/rke2/server/tls/client-ca.crt --egress-selector-config-file=/var/lib/rancher/rke2/server/etc/egress-selector-config.yaml --enable-admission-plugins=NodeRestriction,PodSecurityPolicy --enable-aggregator-routing=true --encryption-provider-config=/var/lib/rancher/rke2/server/cred/encryption-config.json --etcd-cafile=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/rke2/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/rke2/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --feature-gates=JobTrackingWithFinalizers=true --kubelet-certificate-authority=/var/lib/rancher/rke2/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/rke2/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/rke2/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/rke2/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/rke2/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/rke2/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/rke2/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/rke2/server/tls/serving-kube-apiserver.key 
INFO[0000] Running kube-scheduler --authentication-kubeconfig=/var/lib/rancher/rke2/server/cred/scheduler.kubeconfig --authorization-kubeconfig=/var/lib/rancher/rke2/server/cred/scheduler.kubeconfig --bind-address=127.0.0.1 --kubeconfig=/var/lib/rancher/rke2/server/cred/scheduler.kubeconfig --profiling=false --secure-port=10259 
INFO[0000] Running kube-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/rke2/server/cred/controller.kubeconfig --authorization-kubeconfig=/var/lib/rancher/rke2/server/cred/controller.kubeconfig --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-kube-apiserver-client-cert-file=/var/lib/rancher/rke2/server/tls/client-ca.crt --cluster-signing-kube-apiserver-client-key-file=/var/lib/rancher/rke2/server/tls/client-ca.key --cluster-signing-kubelet-client-cert-file=/var/lib/rancher/rke2/server/tls/client-ca.crt --cluster-signing-kubelet-client-key-file=/var/lib/rancher/rke2/server/tls/client-ca.key --cluster-signing-kubelet-serving-cert-file=/var/lib/rancher/rke2/server/tls/server-ca.crt --cluster-signing-kubelet-serving-key-file=/var/lib/rancher/rke2/server/tls/server-ca.key --cluster-signing-legacy-unknown-cert-file=/var/lib/rancher/rke2/server/tls/server-ca.crt --cluster-signing-legacy-unknown-key-file=/var/lib/rancher/rke2/server/tls/server-ca.key --configure-cloud-routes=false --controllers=*,-service,-route,-cloud-node-lifecycle --feature-gates=JobTrackingWithFinalizers=true --kubeconfig=/var/lib/rancher/rke2/server/cred/controller.kubeconfig --profiling=false --root-ca-file=/var/lib/rancher/rke2/server/tls/server-ca.crt --secure-port=10257 --service-account-private-key-file=/var/lib/rancher/rke2/server/tls/service.key --use-service-account-credentials=true 
INFO[0000] Running cloud-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/rke2/server/cred/cloud-controller.kubeconfig --authorization-kubeconfig=/var/lib/rancher/rke2/server/cred/cloud-controller.kubeconfig --bind-address=127.0.0.1 --cloud-provider=rke2 --cluster-cidr=10.42.0.0/16 --configure-cloud-routes=false --kubeconfig=/var/lib/rancher/rke2/server/cred/cloud-controller.kubeconfig --node-status-update-frequency=1m0s --profiling=false 
INFO[0000] Node token is available at /var/lib/rancher/rke2/server/token 
W0629 19:34:26.669529    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
INFO[0000] To join node to cluster: rke2 agent -s https://172.31.13.29:9345 -t ${NODE_TOKEN} 
INFO[0000] Tunnel server egress proxy mode: disabled    
INFO[0000] Wrote kubeconfig /etc/rancher/rke2/rke2.yaml 
INFO[0000] Run: rke2 kubectl                            
W0629 19:34:26.673214    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0629 19:34:26.673876    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock /run/k3s/containerd/containerd.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory". Reconnecting...
INFO[0000] Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory" 
INFO[0000] Cluster-Http-Server 2022/06/29 19:34:26 http: TLS handshake error from 127.0.0.1:44422: remote error: tls: bad certificate 
INFO[0000] Cluster-Http-Server 2022/06/29 19:34:26 http: TLS handshake error from 127.0.0.1:44428: remote error: tls: bad certificate 
DEBU[0000] Password verified locally for node 'ip-172-31-13-29' 
INFO[0000] certificate CN=ip-172-31-13-29 signed by CN=rke2-server-ca@1656527873: notBefore=2022-06-29 18:37:53 +0000 UTC notAfter=2023-06-29 19:34:26 +0000 UTC 
INFO[0000] certificate CN=system:node:ip-172-31-13-29,O=system:nodes signed by CN=rke2-client-ca@1656527873: notBefore=2022-06-29 18:37:53 +0000 UTC notAfter=2023-06-29 19:34:26 +0000 UTC 
INFO[0000] Module overlay was already loaded            
INFO[0000] Module nf_conntrack was already loaded       
INFO[0000] Module br_netfilter was already loaded       
INFO[0000] Module iptable_nat was already loaded        
DEBU[0000] getConntrackMax: using conntrack-min         
INFO[0000] Checking local image archives in /var/lib/rancher/rke2/agent/images for index.docker.io/rancher/rke2-runtime:v1.24.2-rc1-rke2r1 
W0629 19:34:27.675628    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0629 19:34:27.675886    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0629 19:34:27.736286    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock /run/k3s/containerd/containerd.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory". Reconnecting...
INFO[0001] Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory" 
W0629 19:34:29.308837    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0629 19:34:29.500468    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0629 19:34:29.780059    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock /run/k3s/containerd/containerd.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory". Reconnecting...
INFO[0003] Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory" 
INFO[0004] Cluster-Http-Server 2022/06/29 19:34:31 http: TLS handshake error from 172.31.11.21:55128: remote error: tls: bad certificate 
ERRO[0005] runtime core not ready                       
W0629 19:34:31.854530    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0629 19:34:32.489901    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0629 19:34:34.123563    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock /run/k3s/containerd/containerd.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory". Reconnecting...
INFO[0007] Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory" 
W0629 19:34:36.104690    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
INFO[0009] Cluster-Http-Server 2022/06/29 19:34:36 http: TLS handshake error from 172.31.11.21:55156: remote error: tls: bad certificate 
ERRO[0010] runtime core not ready                       
W0629 19:34:36.502778    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
{"level":"warn","ts":"2022-06-29T19:34:36.669Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000b52a80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
INFO[0010] Failed to test data store connection: context deadline exceeded 
INFO[0014] Cluster-Http-Server 2022/06/29 19:34:41 http: TLS handshake error from 172.31.11.21:55180: remote error: tls: bad certificate 
ERRO[0015] runtime core not ready                       
W0629 19:34:41.673055    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0629 19:34:42.288397    4959 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock /run/k3s/containerd/containerd.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory". Reconnecting...
INFO[0015] Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory" 
FATA[0015] failed to setup cri connection: timed out waiting for the condition 
$ sudo ls /var/lib/rancher/rke2/agent/images/
rke2-airgap-images-calico.tar.gz  rke2-airgap-images.tar.gz
rancher-max commented 2 years ago

This has been validated as working using rc2