Closed juadk closed 1 year ago
ok, I think it is brought by new Rancher Manager...
Rancher Manager Version: docker.io/rancher/rancher:v2.7.2-rc1
I just fixed the issue with .xterm-cursor-layer
and there is another issue right after when we try to open a socket to execute kubectl
command.
And after another try, it even went further but finally failed.
https://github.com/rancher/elemental/actions/runs/3846364790
TBH, I think rancher 2.7.2 rc1
is not stable enough to test Elemental.
The same test works perfectly with latest stable Rancher (2.7.0) https://github.com/rancher/elemental/actions/runs/3847135559/jobs/6553244147
Ok, I have issue when I upgrade an elemental node, it look like a DNS issue. Anyway, cluster-agent is in a very bad state:
cattle-system cattle-cluster-agent-6bbb68cdf9-hcfp5 1/1 Running 73 (62s ago) 58m
Logs:
m-bc43d202-906b-4eea-b515-66b6ae62b143:~ # kubectl logs cattle-cluster-agent-6996cb4568-dzflf -n cattle-system
dig: couldn't get address for 'resolver1.opendns.com': failure
INFO: Environment: CATTLE_ADDRESS=10.42.0.26 CATTLE_CA_CHECKSUM=f54f4e5487f67c0dfc32b80d5df207a65a3a2c46d75cb2e7de2d8ceab8c3fe5f CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.43.26.178:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://10.43.26.178:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.26.178 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://10.43.26.178:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.26.178 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.26.178 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UUID=bb2cb535-b886-4589-9fdb-5a28746a1ddb CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-6996cb4568-dzflf CATTLE_SERVER=https://rancher2.adamek.ovh/ CATTLE_SERVER_VERSION=v2.7.2-rc1
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5
Some PR are in progress so we agreed with @aalves08 to wait them to be merged to work back on the debugging.
Even if I'm pushing things for our automated tests, I will test the full stack manually first with RM 2.7.2 rc1.
Stable CI is green, we have to fix the one with Latest.
Latest run is still failing in Upgrade https://github.com/rancher/elemental/actions/runs/3984636470/jobs/6831082377
Upgrade is triggered correctly but cluster status never comes back active...
I managed to reproduce the issue manually and only a restart is enough:
m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3:~ # kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-fleet-system fleet-agent-869fccfd44-j9d8x 1/1 Running 1 (79m ago) 95m
cattle-system cattle-cluster-agent-545cd48dfc-pk64t 0/1 Error 121 (64s ago) 95m
cattle-system helm-operation-6j4cz 0/2 Completed 0 95m
cattle-system rancher-webhook-9b59dd58b-zz779 1/1 Running 1 (79m ago) 95m
cattle-system system-upgrade-controller-79fc9c84b7-fpn45 1/1 Running 1 (79m ago) 95m
kube-system coredns-7b5bbc6644-gjrbw 1/1 Running 1 (79m ago) 96m
kube-system helm-install-traefik-6xdcp 0/1 Completed 1 96m
kube-system helm-install-traefik-crd-xp2vc 0/1 Completed 0 96m
kube-system local-path-provisioner-687d6d7765-xfdjb 1/1 Running 1 (79m ago) 96m
kube-system metrics-server-84f8d4c4fc-ffmq7 1/1 Running 1 (79m ago) 96m
kube-system svclb-traefik-29d816d6-xkk8s 2/2 Running 2 (79m ago) 96m
kube-system traefik-6b8f69d897-d2zpq 1/1 Running 1 (79m ago) 96m
Cluster agent pod log:
m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3:~ # kubectl logs -f cattle-cluster-agent-545cd48dfc-pk64t -n cattle-system
dig: couldn't get address for 'resolver1.opendns.com': failure
INFO: Environment: CATTLE_ADDRESS=10.42.0.20 CATTLE_CA_CHECKSUM=654b3cdcc64e2aec4eec29ea83f93891d5da423b863472b36d4211aa2ebd7867 CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.43.150.205:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://10.43.150.205:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.150.205 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://10.43.150.205:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.150.205 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.150.205 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UUID=0ef5fbf5-af0d-40bf-8f64-b46fc79b05d7 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-545cd48dfc-pk64t CATTLE_SERVER=https://rancher3.adamek.ovh CATTLE_SERVER_VERSION=v2.7.2-rc1
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5
Looks like k3s is crashing:
an 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 k3s[17228]: W0123 15:56:20.441572 17228 dispatcher.go:153] Failed calling webhook, failing closed rancher.cattle.io.namespaces: failed calling webhook "rancher.cattle.io.namespaces": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.11:9443, code 503: 503 Service Unavailable
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 k3s[17228]: I0123 15:56:20.441683 17228 trace.go:205] Trace[1716781708]: "Create" url:/api/v1/namespaces,user-agent:k3s/v1.24.8+k3s1 (linux/amd64) kubernetes/648004e/service-controller,audit-id:c98afa47-0b80-4fa6-a895-5c939b61f82f,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (23-Jan-2023 15:56:17.387) (total time: 3054ms):
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 k3s[17228]: Trace[1716781708]: [3.054124477s] [3.054124477s] END
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 k3s[17228]: time="2023-01-23T15:56:20Z" level=fatal msg="Failed to register service-controller handlers: Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\": proxy error from 127.0.0.1:6443 while dialing 10.42.0.11:9443, code 503: 503 Service Unavailable"
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: k3s.service: Failed with result 'exit-code'.
Jan 23 15:56:22 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: cri-containerd-b9d1d439663cb2eeb7926f553fb2149b047593e9429d4da4a40bb0a0a160a5f2.scope: Deactivated successfully.
Jan 23 15:56:25 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 140.
Jan 23 15:56:25 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: Stopped Lightweight Kubernetes.
Jan 23 15:56:25 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: Starting Lightweight Kubernetes...
Jan 23 15:56:25 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 sh[17551]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
I just tried to downgrade my Rancher Manager to 2.7.0 and it worked well:
I will check if I hit same issue with RKE2.
We don't have the same issue with RKE2! RKE2 test passed https://github.com/rancher/elemental/actions/runs/3994321457
The CI is failing since yesterday about a missing element, Cypress does not see the xterm cursor.
Failure: https://github.com/rancher/elemental/actions/runs/3835177495/jobs/6531093559