e2e: Adapt Cypress code to latest dev Rancher (2.7.2)

juadk commented 1 year ago

The CI is failing since yesterday about a missing element, Cypress does not see the xterm cursor.

Failure: https://github.com/rancher/elemental/actions/runs/3835177495/jobs/6531093559

juadk commented 1 year ago

ok, I think it is brought by new Rancher Manager... Rancher Manager Version: docker.io/rancher/rancher:v2.7.2-rc1

juadk commented 1 year ago

I just fixed the issue with .xterm-cursor-layer and there is another issue right after when we try to open a socket to execute kubectl command.

juadk commented 1 year ago

And after another try, it even went further but finally failed. https://github.com/rancher/elemental/actions/runs/3846364790 TBH, I think rancher 2.7.2 rc1 is not stable enough to test Elemental.

juadk commented 1 year ago

The same test works perfectly with latest stable Rancher (2.7.0) https://github.com/rancher/elemental/actions/runs/3847135559/jobs/6553244147

juadk commented 1 year ago

Ok, I have issue when I upgrade an elemental node, it look like a DNS issue. Anyway, cluster-agent is in a very bad state:

cattle-system         cattle-cluster-agent-6bbb68cdf9-hcfp5                             1/1     Running     73 (62s ago)   58m

Logs:

m-bc43d202-906b-4eea-b515-66b6ae62b143:~ # kubectl logs cattle-cluster-agent-6996cb4568-dzflf -n cattle-system
dig: couldn't get address for 'resolver1.opendns.com': failure
INFO: Environment: CATTLE_ADDRESS=10.42.0.26 CATTLE_CA_CHECKSUM=f54f4e5487f67c0dfc32b80d5df207a65a3a2c46d75cb2e7de2d8ceab8c3fe5f CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.43.26.178:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://10.43.26.178:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.26.178 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://10.43.26.178:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.26.178 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.26.178 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UUID=bb2cb535-b886-4589-9fdb-5a28746a1ddb CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-6996cb4568-dzflf CATTLE_SERVER=https://rancher2.adamek.ovh/ CATTLE_SERVER_VERSION=v2.7.2-rc1
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5

Some PR are in progress so we agreed with @aalves08 to wait them to be merged to work back on the debugging.

juadk commented 1 year ago

Even if I'm pushing things for our automated tests, I will test the full stack manually first with RM 2.7.2 rc1.

Stable CI is green, we have to fix the one with Latest.

juadk commented 1 year ago

Latest run is still failing in Upgrade https://github.com/rancher/elemental/actions/runs/3984636470/jobs/6831082377

Upgrade is triggered correctly but cluster status never comes back active...

juadk commented 1 year ago

I managed to reproduce the issue manually and only a restart is enough:

m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3:~ # kubectl get pods -A
NAMESPACE             NAME                                         READY   STATUS      RESTARTS        AGE
cattle-fleet-system   fleet-agent-869fccfd44-j9d8x                 1/1     Running     1 (79m ago)     95m
cattle-system         cattle-cluster-agent-545cd48dfc-pk64t        0/1     Error       121 (64s ago)   95m
cattle-system         helm-operation-6j4cz                         0/2     Completed   0               95m
cattle-system         rancher-webhook-9b59dd58b-zz779              1/1     Running     1 (79m ago)     95m
cattle-system         system-upgrade-controller-79fc9c84b7-fpn45   1/1     Running     1 (79m ago)     95m
kube-system           coredns-7b5bbc6644-gjrbw                     1/1     Running     1 (79m ago)     96m
kube-system           helm-install-traefik-6xdcp                   0/1     Completed   1               96m
kube-system           helm-install-traefik-crd-xp2vc               0/1     Completed   0               96m
kube-system           local-path-provisioner-687d6d7765-xfdjb      1/1     Running     1 (79m ago)     96m
kube-system           metrics-server-84f8d4c4fc-ffmq7              1/1     Running     1 (79m ago)     96m
kube-system           svclb-traefik-29d816d6-xkk8s                 2/2     Running     2 (79m ago)     96m
kube-system           traefik-6b8f69d897-d2zpq                     1/1     Running     1 (79m ago)     96m

Cluster agent pod log:

m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3:~ # kubectl logs -f cattle-cluster-agent-545cd48dfc-pk64t -n cattle-system
dig: couldn't get address for 'resolver1.opendns.com': failure
INFO: Environment: CATTLE_ADDRESS=10.42.0.20 CATTLE_CA_CHECKSUM=654b3cdcc64e2aec4eec29ea83f93891d5da423b863472b36d4211aa2ebd7867 CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.43.150.205:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://10.43.150.205:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.150.205 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://10.43.150.205:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.150.205 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.150.205 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UUID=0ef5fbf5-af0d-40bf-8f64-b46fc79b05d7 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-545cd48dfc-pk64t CATTLE_SERVER=https://rancher3.adamek.ovh CATTLE_SERVER_VERSION=v2.7.2-rc1
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5

Looks like k3s is crashing:

an 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 k3s[17228]: W0123 15:56:20.441572   17228 dispatcher.go:153] Failed calling webhook, failing closed rancher.cattle.io.namespaces: failed calling webhook "rancher.cattle.io.namespaces": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.11:9443, code 503: 503 Service Unavailable
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 k3s[17228]: I0123 15:56:20.441683   17228 trace.go:205] Trace[1716781708]: "Create" url:/api/v1/namespaces,user-agent:k3s/v1.24.8+k3s1 (linux/amd64) kubernetes/648004e/service-controller,audit-id:c98afa47-0b80-4fa6-a895-5c939b61f82f,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (23-Jan-2023 15:56:17.387) (total time: 3054ms):
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 k3s[17228]: Trace[1716781708]: [3.054124477s] [3.054124477s] END
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 k3s[17228]: time="2023-01-23T15:56:20Z" level=fatal msg="Failed to register service-controller handlers: Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\": proxy error from 127.0.0.1:6443 while dialing 10.42.0.11:9443, code 503: 503 Service Unavailable"
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Jan 23 15:56:20 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: k3s.service: Failed with result 'exit-code'.

Jan 23 15:56:22 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: cri-containerd-b9d1d439663cb2eeb7926f553fb2149b047593e9429d4da4a40bb0a0a160a5f2.scope: Deactivated successfully.
Jan 23 15:56:25 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 140.
Jan 23 15:56:25 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: Stopped Lightweight Kubernetes.
Jan 23 15:56:25 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 systemd[1]: Starting Lightweight Kubernetes...
Jan 23 15:56:25 m-3f594f94-50a1-4fb3-a93b-df6bc26e2bd3 sh[17551]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service

juadk commented 1 year ago

I just tried to downgrade my Rancher Manager to 2.7.0 and it worked well:

Delete elemental cluster
Uninstall latest devel Rancher Manager
Install latest stable Rancher Manager
Elemental operator was not uninstalled
Boot a node with the same ISO and same registration file
Create the cluster using same k3s version v1.24.8
Once node installed, I was able to reboot it and the cluster got back to active state without issue

I will check if I hit same issue with RKE2.

juadk commented 1 year ago

We don't have the same issue with RKE2! RKE2 test passed https://github.com/rancher/elemental/actions/runs/3994321457

rancher / elemental

e2e: Adapt Cypress code to latest dev Rancher (2.7.2) #605