kubermatic / operating-system-manager

Operating System Manager is responsible for creating and managing the configuration that are needed to configure worker nodes
Apache License 2.0
36 stars 31 forks source link

fix(controller): handle error for invalid osp #396

Closed oliverbaehler closed 4 months ago

oliverbaehler commented 4 months ago

What this PR does / why we need it:

I have created my own CustomOperatingSystemProfile and attempted to use it in my clusters. However whenever a MachineDeployment was associated with the profile the osm controller started crashing because of an uncaught nil pointer:

$ kubectl logs -f operating-system-manager-f849789b8-hgk7w -n cluster-n57gh8qjcd
Defaulted container "operating-system-manager" out of: operating-system-manager, copy-http-prober (init)
{"level":"info","time":"2024-06-10T17:43:43.487Z","logger":"http-prober","caller":"http-prober/main.go:137","msg":"Probing","attempt":1,"max-attempts":100,"target":"https://apiserver-external.cluster-n57gh8qjcd.svc.cluster.local./healthz"}
{"level":"info","time":"2024-06-10T17:43:43.491Z","logger":"http-prober","caller":"http-prober/main.go:126","msg":"Hostname resolved","hostname":"apiserver-external.cluster-n57gh8qjcd.svc.cluster.local.","address":"10.100.3.234:443"}
{"level":"info","time":"2024-06-10T17:43:43.494Z","logger":"http-prober","caller":"http-prober/main.go:150","msg":"Endpoint is available"}
{"level":"info","time":"2024-06-10T17:43:43.518Z","caller":"osm-controller/main.go:309","msg":"starting manager"}
{"level":"info","time":"2024-06-10T17:43:43.519Z","logger":"controller-runtime.metrics","caller":"manager/runnable_group.go:223","msg":"Starting metrics server"}
{"level":"info","time":"2024-06-10T17:43:43.519Z","logger":"controller-runtime.metrics","caller":"manager/runnable_group.go:223","msg":"Serving metrics server","bindAddress":"0.0.0.0:8080","secure":false}
{"level":"info","time":"2024-06-10T17:43:43.519Z","caller":"manager/runnable_group.go:223","msg":"starting server","kind":"health probe","addr":"[::]:8085"}
I0610 17:43:43.519278       1 leaderelection.go:250] attempting to acquire leader lease kube-system/operating-system-manager...
I0610 17:45:13.387812       1 leaderelection.go:260] successfully acquired lease kube-system/operating-system-manager
{"level":"info","time":"2024-06-10T17:45:13.388Z","caller":"controller/controller.go:234","msg":"Starting EventSource","controller":"operating-system-config-controller","source":"kind source: *v1alpha1.MachineDeployment"}
{"level":"info","time":"2024-06-10T17:45:13.388Z","caller":"controller/controller.go:234","msg":"Starting EventSource","controller":"OperatingSystemProfileController","source":"kind source: *v1.Deployment"}
{"level":"info","time":"2024-06-10T17:45:13.388Z","caller":"controller/controller.go:234","msg":"Starting Controller","controller":"operating-system-config-controller"}
{"level":"info","time":"2024-06-10T17:45:13.388Z","caller":"controller/controller.go:234","msg":"Starting Controller","controller":"OperatingSystemProfileController"}
{"level":"info","time":"2024-06-10T17:45:13.490Z","caller":"controller/controller.go:234","msg":"Starting workers","controller":"operating-system-config-controller","worker count":10}
{"level":"info","time":"2024-06-10T17:45:13.491Z","caller":"osc/osc_controller.go:138","msg":"Reconciling OSC resource..","request":"kube-system/practical-blackwell"}
{"level":"info","time":"2024-06-10T17:45:13.593Z","caller":"controller/controller.go:234","msg":"Starting workers","controller":"OperatingSystemProfileController","worker count":10}
{"level":"info","time":"2024-06-10T17:45:13.594Z","caller":"osp/osp_controller.go:105","msg":"Reconciling default OSP resource.."}
{"level":"info","time":"2024-06-10T17:45:13.594Z","caller":"osp/osp_controller.go:105","msg":"Reconciling default OSP resource.."}
{"level":"info","time":"2024-06-10T17:45:13.594Z","caller":"osp/osp_controller.go:105","msg":"Reconciling default OSP resource.."}
{"level":"info","time":"2024-06-10T17:45:13.594Z","caller":"osp/osp_controller.go:105","msg":"Reconciling default OSP resource.."}
{"level":"info","time":"2024-06-10T17:45:13.690Z","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-flatcar"}
{"level":"info","time":"2024-06-10T17:45:13.707Z","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-amzn2"}
{"level":"info","time":"2024-06-10T17:45:13.797Z","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-rockylinux"}
{"level":"info","time":"2024-06-10T17:45:13.899Z","caller":"runtime/panic.go:770","msg":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"operating-system-config-controller","object":{"name":"practical-blackwell","namespace":"kube-system"},"namespace":"kube-system","name":"practical-blackwell","reconcileID":"2b09108a-c84b-4114-b88e-5aad7b00b559"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x130 pc=0x15f6c40]

goroutine 305 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x177e000?, 0x28da590?})
    runtime/panic.go:770 +0x132
k8c.io/operating-system-manager/pkg/controllers/osc.(*Reconciler).reconcileOperatingSystemConfigs(0xc00034e680, {0x1c665b8, 0xc000711bf0}, 0xc00035f408)
    k8c.io/operating-system-manager/pkg/controllers/osc/osc_controller.go:272 +0x8e0
k8c.io/operating-system-manager/pkg/controllers/osc.(*Reconciler).reconcile(0xc00034e680, {0x1c665b8, 0xc000711bf0}, 0xc00035f408)
    k8c.io/operating-system-manager/pkg/controllers/osc/osc_controller.go:184 +0xdd
k8c.io/operating-system-manager/pkg/controllers/osc.(*Reconciler).Reconcile(0xc00034e680, {0x1c665b8, 0xc000711bf0}, {{{0xc000a02180, 0xb}, {0xc00079e5a0, 0x13}}})
    k8c.io/operating-system-manager/pkg/controllers/osc/osc_controller.go:166 +0x405
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1c6a578?, {0x1c665b8?, 0xc000711bf0?}, {{{0xc000a02180?, 0xb?}, {0xc00079e5a0?, 0x0?}}})
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00072b680, {0x1c665f0, 0xc0005df630}, {0x17fe720, 0xc000020860})
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316 +0x3bc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00072b680, {0x1c665f0, 0xc0005df630})
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 +0x1be
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 278
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:223 +0x50c

And

operating-system-manager-f849789b8-hgk7w             0/1     CrashLoopBackOff   233 (3m3s ago)   20h

Making it impossible to lifecycle other nodes in that cluster. This change catches the error and returns an error. If we implement that, it would also be quickly clear, that I am just too dumb to write profiles:

...
vel":"info","time":"2024-06-10T19:45:31.402+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-rockylinux"}
{"level":"info","time":"2024-06-10T19:45:31.403+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-rhel"}
{"level":"info","time":"2024-06-10T19:45:31.492+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-ubuntu"}
{"level":"info","time":"2024-06-10T19:45:31.492+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-amzn2"}
{"level":"info","time":"2024-06-10T19:45:31.593+0200","caller":"reconciling/ensure.go:165","msg":"updated resource","kind":"v1alpha1.OperatingSystemProfile","namespace":"kube-system","name":"osp-flatcar"}
{"level":"error","time":"2024-06-10T19:45:31.648+0200","caller":"osc/osc_controller.go:167","msg":"Reconciling failed","error":"failed to reconcile operating system config: failed to generate OSC: failed to render bootstrapping file templates: failed to populate OSP file template: failed to parse OSP file [/opt/bin/node-start.sh] template: template: /opt/bin/node-start.sh:3: unexpected \"\\\\\" in template clause"}
{"level":"error","time":"2024-06-10T19:45:31.648+0200","caller":"controller/controller.go:261","msg":"Reconciler error","controller":"operating-system-config-controller","controllerGroup":"cluster.k8s.io","controllerKind":"MachineDeployment","MachineDeployment":{"name":"practical-blackwell","namespace":"kube-system"},"namespace":"kube-system","name":"practical-blackwell","reconcileID":"b6c9d8c9-e19f-4949-b94b-a308ea27cbf8","error":"failed to reconcile operating system config: failed to generate OSC: failed to render bootstrapping file templates: failed to populate OSP file template: failed to parse OSP file [/opt/bin/node-start.sh] template: template: /opt/bin/node-start.sh:3: unexpected \"\\\\\" in template clause"}

However since there is no previous validation for the profile or anything like that and it's directly replicated to all seed clusters which may cause partial degradation of kubermatic components (OSM) we should probably just handle that error.

Which issue(s) this PR fixes:

Fixes #

What type of PR is this?

Special notes for your reviewer:

Does this PR introduce a user-facing change? Then add your Release Note here:

NONE

Documentation:

NONE
kubermatic-bot commented 4 months ago

Hi @oliverbaehler. Thanks for your PR.

I'm waiting for a kubermatic member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
kubermatic-bot commented 4 months ago

LGTM label has been added.

Git tree hash: 07334420f5313de51ec7b2d78a195fb8a93d8442

kubermatic-bot commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahmedwaleedmalik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubermatic/operating-system-manager/blob/main/OWNERS)~~ [ahmedwaleedmalik] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
ahmedwaleedmalik commented 4 months ago

/cherry-pick release/v1.5

kubermatic-bot commented 4 months ago

@ahmedwaleedmalik: new pull request created: #397

In response to [this](https://github.com/kubermatic/operating-system-manager/pull/396#issuecomment-2160014302): >/cherry-pick release/v1.5 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.