rancher / elemental-operator

The Elemental operator is responsible for managing the OS versions and maintaining a machine inventory to assist with edge or baremetal installations.
Apache License 2.0
40 stars 17 forks source link

Issue when upgrading elemental OS from 2.0.2 to 2.0.4 #798

Closed juadk closed 1 month ago

juadk commented 1 month ago

I hit this bug while validating the elemental operator 1.5.4 chart in the rancher 2.9.0-alpha7 marketplace.

Main error message in the UI after creating the upgrade group: image

ErrApplied(1) [Cluster fleet-default/mycluster: unable to build kubernetes objects from release manifest: resource mapping not found for name: "os-upgrader-myupgrade" namespace: "cattle-system" from "": no matches for kind "Plan" in version "upgrade.cattle.io/v1" ensure CRDs are installed first]

Env

RM 2.9.0-alpha7 hosted in Digital Ocean Elemental nodes are physical NUC systems at home Elemental UI dev version: 1.3.1-rc7

How I reproduce it

I want to try reproducing it with rancher 2.9-head and rancher 2.8.5 stable. I will do it and report outputs here.

Additional logs

From fleet-controller:

{"level":"error","ts":"2024-07-18T07:02:09Z","logger":"controller-runtime.source.EventHandler","msg":"if kind is a CRD, it should be installed before calling Start","kind":"ImageScan.fleet.cattle.io","error":"no matches for kind \"ImageScan\" in version \"fleet.cattle.io/v1alpha1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/source/kind.go:63\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.29.4/pkg/util/wait/loop.go:87\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.29.4/pkg/util/wait/loop.go:88\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.29.4/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/source/kind.go:56"}
time="2024-07-18T07:02:20Z" level=info msg="Skipping bundledeployment with empty namespace \"fleet-agent-local\""
time="2024-07-18T07:02:20Z" level=info msg="Skipping bundledeployment with empty namespace \"fleet-agent-local\""

Bundle:

root@elemental-ja-wlfvts-rancher-server:~# kubectl get bundle -A
NAMESPACE       NAME                             BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local     fleet-agent-local                1/1
fleet-default   mycluster-managed-system-agent   0/1                       ErrApplied(1) [Cluster fleet-default/mycluster: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "system-agent-upgrader" namespace: "cattle-system" from "": no matches for kind "Plan" in version "upgrade.cattle.io/v1"...
fleet-default   fleet-agent-mycluster            1/1
fleet-default   mos-myupgrade                    0/1                       ErrApplied(1) [Cluster fleet-default/mycluster: unable to build kubernetes objects from release manifest: resource mapping not found for name: "os-upgrader-myupgrade" namespace: "cattle-system" from "": no matches for kind "Plan" in version "upgrade.cattle.io/v1"...
root@elemental-ja-wlfvts-rancher-server:~# kubectl get bundledeployment -A
NAMESPACE                                      NAME                             DEPLOYED                                                                                                                                                                                                                             MONITORED   STATUS
cluster-fleet-local-local-1a3d67d0a899         fleet-agent-local                True                                                                                                                                                                                                                                 True
cluster-fleet-default-mycluster-2faea9154081   mycluster-managed-system-agent   Error: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "system-agent-upgrader" namespace: "cattle-system" from "": no matches for kind "Plan" in version "upgrade.cattle.io/v1"...
cluster-fleet-default-mycluster-2faea9154081   fleet-agent-mycluster            True                                                                                                                                                                                                                                 True
cluster-fleet-default-mycluster-2faea9154081   mos-myupgrade                    Error: unable to build kubernetes objects from release manifest: resource mapping not found for name: "os-upgrader-myupgrade" namespace: "cattle-system" from "": no matches for kind "Plan" in version "upgrade.cattle.io/v1"...
juadk commented 1 month ago

Tries with other versions:

juadk commented 1 month ago

The same issue in GCP, DO, k3s 1.30 or k3s 1.27 (for downstream cluster). I wonder if the issue is not the fact that Rancher Manager cannot reach the VMs in the libvirt network.

juadk commented 1 month ago

I also reproduced the issue with rancher 2.9-head (with os version 2.0.2 and 2.0.4)

juadk commented 1 month ago

Yesterday I was not able to reproduce the bug with the following stack:

RM 2.8.5 in Digital Ocean
Operator 1.5.3 from the marketplace
Elemental UI 1.3.0
juadk commented 1 month ago

I know what the root cause is... and I understand why we do not see it in the CI. It's simply because of this: machineName: ${System Data/Runtime/Hostname} In the CI, elemental nodes are named node01, node02, node03 (or something like that) but when I test with my NUC at home, I remove the machineName line. So I got a very long name starting with m- and looks like this is what prevents to deploying system-upgrade-controller I just tried at home by using machineName: node01 and now, everything is working as expected. Inside my downstream elemental node (NUC):

node01:~ # kubectl get pods -A
NAMESPACE             NAME                                                              READY   STATUS      RESTARTS   AGE
cattle-fleet-system   fleet-agent-0                                                     2/2     Running     0          2m14s
cattle-system         apply-system-agent-upgrader-on-node01-with-dd6ea7d86da501-n7bb7   0/1     Completed   0          75s
cattle-system         apply-system-agent-upgrader-on-node02-with-dd6ea7d86da501-rgxbt   0/1     Completed   0          114s
cattle-system         apply-system-agent-upgrader-on-node03-with-dd6ea7d86da501-g7rt2   0/1     Completed   0          2m1s
cattle-system         cattle-cluster-agent-85bf7b64c8-9n29c                             1/1     Running     0          74s
cattle-system         cattle-cluster-agent-85bf7b64c8-gfzfn                             1/1     Running     0          113s
cattle-system         helm-operation-gq5cx                                              1/2     NotReady    0          49s
cattle-system         rancher-webhook-6bc575f5d4-vzpzz                                  1/1     Running     0          39s
cattle-system         system-upgrade-controller-795b6fdf6-bm9zh                         1/1     Running     0          2m7s
kube-system           coredns-6799fbcd5-bsfz6                                           1/1     Running     0          3m23s
kube-system           helm-install-traefik-crd-96vjj                                    0/1     Completed   1          75s
kube-system           helm-install-traefik-vf4jx                                        0/1     Completed   1          75s
kube-system           local-path-provisioner-6f5d79df6-hwscq                            1/1     Running     0          3m23s
kube-system           metrics-server-54fd9b65b-4t4t2                                    1/1     Running     0          3m23s
kube-system           svclb-traefik-034a9b7f-64qrl                                      2/2     Running     0          114s
kube-system           svclb-traefik-034a9b7f-bkqw6                                      2/2     Running     0          74s
kube-system           svclb-traefik-034a9b7f-bsx97                                      2/2     Running     0          3m13s
kube-system           traefik-7d5f6474df-kkt4g                                          1/1     Running     0          3m13s
juadk commented 1 month ago

Well, I tried to reproduce it two times and it worked with default machine name: image

It looks like a race condition, I will try again tomorrow.

juadk commented 1 month ago

Yesterday, I could not reproduce it with latest rancher 2.9.0-rc4 so we do not know what the issue is/was. Let's keep the issue under monitoring for a few weeks.

fgiudici commented 1 month ago

Additional test results, always used the Rancher 2.9.0-alpha7 version:

  1. no issue with libvirt installed Rancher (tried with child clusters machine in both the same network and in a remote network under NAT)
  2. issue spotted with Rancher installed on hyperscaler and remote machine for child clusters
  3. no issue after some time, Rancher installed on hyperscaler and remote machine for child clusters

Investigating case 2., the issue was that the System Upgrade Controller was not installed by the Rancher provisioning. That broke OS upgrades. That seems solved now, as tested in scenario 3 (and with all the most recent tests from Julien). SUC is now correctly provisioned.

Tested OS upgrade on scenario 3. : worked as expected. So, the issue looks now solved: likely a transient bug in a component of a under development Rancher version, now fixed. Closing