openshift / windows-machine-config-operator

Windows MCO for OpenShift that handles addition of Windows nodes to the cluster
Apache License 2.0
42 stars 68 forks source link

Community Operator fails install in OKD-4.14 #1954

Closed kaolaaz163 closed 5 months ago

kaolaaz163 commented 11 months ago

Attention

Version

OKD

Cluster Version

4.14.0-0.okd-2023-11-14-101924

Platform

Platform agnostic (type=none)

Proxy

No

WMCO Version

6.0.0

Windows version

2019

What happened?

When deploying WMCO through OperatorHub in the deployed OKD 4.14 cluster, only the WMCO 6.0.0 version is seen. When deploying WMCO in a cluster, the following error is reported when WMCO's Pod starts.

failed to validate required cluster configuration {"error": "error validating k8s version: Unsupported server version: v1.27.1-3351+b49f9d1356bca4-dirty. Supported versions are v1.24.x to v1.25.x", "errorVerbose": "Unsupported server version: v1.27.1-3351+b49f9d1356bca4-dirty. Supported versions are v1.24.x to v1.25.x\ngithub.com/openshift/windows-machine-config-operator/pkg/cluster.(config).validateK8sVersion\n\t/build/windows-machine-config-operator/pkg/cluster/config.go:141\ngithub.com/openshift/windows-machine-config-operator/pkg/cluster.(config).Validate\n\t/build/windows-machine-config-operator/pkg/cluster/config.go:148\nmain.main\n\t/build/windows-machine-config-operator/cmd/operator/main.go:102\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571\nerror validating k8s version\ngithub.com/openshift/windows-machine-config-operator/pkg/cluster.(*config).Validate\n\t/build/windows-machine-config-operator/pkg/cluster/config.go:150\nmain.main\n\t/build/windows-machine-config-operator/cmd/operator/main.go:102\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571"}

What did you expect to happen?

WMCO can run successfully

Steps to reproduce the issue

Install OKD version 4.14 and deploy WMCO through OperatorHub

Do you have a workaround for this issue?

No response

trevor-dolby-at-ibm-com commented 11 months ago

@kaolaaz163 I hit the same issue with my 4.14.2 single-node cluster and was able to get the operator working by using the release-4.14 branch from this repo and deploying using the development instructions. I was trying to follow the guide at https://www.opensourcerers.org/2023/08/21/quick-start-to-smallest-openshift-cluster-for-windows-workload/ when I hit the issue you describe.

Approximate summary of steps on Ubuntu 22 WSL2 with Docker:

git clone --recurse-submodules -b release-4.14  https://github.com/openshift/windows-machine-config-operator.git
export OPENSHIFT_CI=false
export KUBE_SSH_KEY_PATH=/somewhere/windowskey
export OPERATOR_IMAGE=quay.io/some_user/wmco:test-4.14-1
cd windows-machine-config-operator
sudo ln -s /usr/bin/docker /usr/local/bin/podman
make base-img
make wmco-img IMG=$OPERATOR_IMAGE
docker push  quay.io/some_user/wmco:test-4.14-1
hack/olm.sh run -k $KUBE_SSH_KEY_PATH

The scripts use podman but docker works fine also, and the install is very easy once you get the right source level. I used the current release-4.14 branch but the exact git commit was

commit 9456533c6d17c741e3f79c32613d4a1cdad6cf74 (HEAD -> release-4.14, origin/release-4.14)

The blog post linked above describes all the necessary steps, and I was very impressed by the ease of use for this operator, both in terms of running it and also building it. Thank you to the people who wrote up the developer docs!

mansikulkarni96 commented 11 months ago

@kaolaaz163 thanks for opening up the issue. Releasing a community 9.0.0/4.14 WMCO is in the pipeline.

kaolaaz163 commented 11 months ago

Thanks @trevor-dolby-at-ibm-com ,Following your steps, WMCO can be installed successfully.I encountered the following error again when bootstrap the windows node, prompting that the windows-instance-config-daemon service could not be found.Can anyone help me take a look?

2023-12-05T07:27:56Z ERROR Reconciler error {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "ConfigMap": {"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"}, "namespace": "openshift-windows-machine-config-operator", "name": "windows-instances", "reconcileID": "4e0d7a24-64e3-4bfb-8aba-40463e363657", "error": "error configuring host with address 192.168.3.156: bootstrapping the Windows instance failed: unable to cleanup the Windows instance: error ensuring windows-instance-config-daemon Windows service is removed: error checking if windows-instance-config-daemon Windows service exists: error running sc.exe qc windows-instance-config-daemon: Process exited with status 1060"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226 2023-12-05T07:27:56Z DEBUG events error configuring host with address 192.168.3.156: bootstrapping the Windows instance failed: unable to cleanup the Windows instance: error ensuring windows-instance-config-daemon Windows service is removed: error checking if windows-instance-config-daemon Windows service exists: error running sc.exe qc windows-instance-config-daemon: Process exited with status 1060{"type": "Warning", "object": {"kind":"ConfigMap","namespace":"openshift-windows-machine-config-operator","name":"windows-instances","uid":"3443a130-f3ca-4835-8860-b310f97b3bce","apiVersion":"v1","resourceVersion":"13302253"}, "reason": "InstanceSetupFailure"}

There were the above errors when using windows server 2019. After changing to windows server 2022, it worked.

trevor-dolby-at-ibm-com commented 11 months ago

@kaolaaz163 I was going to suggest the Windows version might be the problem (I'm running 10.0.20348.1) and it sounds like upgrading has indeed helped; glad to hear you're up and running now.

kaolaaz163 commented 11 months ago

Now it can work normally,But kubelet seems to still have some problems. On the Openshift target page, you can see unauthorized errors reported by related endpoints.

windows

At the same time, the kubelet log has the following error report.

windows2

tdolby-at-uk-ibm-com commented 11 months ago

I see the same errors in the "Metrics targets" view, and my kubelet worked for a day or so and is now showing errors like yours. Seems as if certificate rotation might have gone wrong in some way?

kaolaaz163 commented 11 months ago

Can anyone help me to find what exactly is causing the kubelet exception?

jrvaldes commented 10 months ago

WINC-1175 tracks the community 4.14 WMCO operator release.

openshift-bot commented 7 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 6 months ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 5 months ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 5 months ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/windows-machine-config-operator/issues/1954#issuecomment-2155863371): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.