rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
22.8k stars 2.92k forks source link

rke2-worker-plan fail on windows server 2022 node #45912

Open Ducatel opened 2 weeks ago

Ducatel commented 2 weeks ago

Environmental Info: RKE2 Version:

On windows nodes:

rke2.exe version v1.27.12+rke2r1 (25b27b4e4709a2ac4c550609ad730a9e172d110a)
go version go1.21.8

On linux node:

rke2 version v1.27.12+rke2r1 (25b27b4e4709a2ac4c550609ad730a9e172d110a)
go version go1.21.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

On windows nodes:

windows server 2022 21H2 Build 20348.2527

On linux node:

Linux  5.14.0-427.18.1.el9_4.x86_64 rancher/rke2#1 SMP PREEMPT_DYNAMIC Mon May 13 10:47:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

Describe the bug:

Repeatedly observe a failing pod named apply-rke2-worker-plan* due to ImagePullBackOff (Back-off pulling image "rancher/rke2-upgrade:v1.27.12-rke2r1") on windows node.

But pods scheduled on linux node seems to work properly.

image

image

When I tried to manually pull the image from a windows node:

> crictl pull rancher/rke2-upgrade:v1.27.12-rke2r1
E0621 14:41:14.505286    9092 remote_image.go:167] "PullImage from image service failed" err="rpc error: code = NotFound desc = failed to pull and unpack image \"docker.io/rancher/rke2-upgrade:v1.27.12-rke2r1\": no match for platform in manifest: not found" image="rancher/rke2-upgrade:v1.27.12-rke2r1"
time="2024-06-21T14:41:14+02:00" level=fatal msg="pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \"docker.io/rancher/rke2-upgrade:v1.27.12-rke2r1\": no match for platform in manifest: not found"

So seems to not have windows compatible build.

Steps To Reproduce:

Expected behavior:

Upgrade should not fail on windows node. (I don't really know what this upgrade do ???)

Actual behavior:

Upgrade fail on windows node

Thanks in advance for your help

manuelbuil commented 2 weeks ago

Automatic upgrading does not work on windows nodes. But I think we should include that in the docs or show a useful log to avoid confusion. Would you mind explaining how did you trigger the upgrade process? Thanks!

Ducatel commented 2 weeks ago

Hi,

As I said, I just follow the step from Quick start Rancher documentation and there is no place where this upgrade plan is mentioned. So I don't know at all how I can remove this schedule on windows node. After few search, It seems many peoples face this issue ( on rancher forum, stackoverflow, etc.....) without solutions.

I can probably update the node selector in the rke2-worker-plan to avoid that, but I don't know if it's safe or not

image

manuelbuil commented 2 weeks ago

In the RKE2 docs, we are explaining that plan: https://docs.rke2.io/upgrade/automated_upgrade

Could you be more specific about the Quick start Rancher documentation? What exact doc are you looking at? Unfortunately, in the Quick Start Guide, I don't see anything describing upgrades: image

I think we should create some anti-affinity with Windows nodes and warn the user that it needs to do that upgrade manually

Ducatel commented 2 weeks ago

I followed :

And yes, nothing about upgrade in theses documentation.

So I tried to update the plan with

    matchExpressions:
    - key: node-role.kubernetes.io/master
      operator: DoesNotExist
    - key: kubernetes.io/os
      operator: NotIn
      values:
        - windows

But seems to not be applied

manuelbuil commented 2 weeks ago

Try with this item in the matchExpressions ==> - {key: beta.kubernetes.io/os, operator: In, values: ["linux"]}

manuelbuil commented 2 weeks ago

I'll probably update the docs warning about windows not being supported in the SUC and add that matchExpression in the example, so that it is less likely that people get confused

Ducatel commented 2 weeks ago

Try with this item in the matchExpressions ==> - {key: beta.kubernetes.io/os, operator: In, values: ["linux"]}

It's not a matter of how I write the matchExpressions. Just when I edit the plan by rancher UI or kubectl edit my change seems to not be reflected. When I get back the yaml config, it's still the same.

I'll probably update the docs warning about windows not being supported in the SUC and add that matchExpression in the example, so that it is less likely that people get confused

The thing is, I didn't create this plan myself. So even with a documentation updated properly, some people will still face the issu

manuelbuil commented 2 weeks ago

Try with this item in the matchExpressions ==> - {key: beta.kubernetes.io/os, operator: In, values: ["linux"]}

It's not a matter of how I write the matchExpressions. Just when I edit the plan by rancher UI or kubectl edit my change seems to not be reflected. When I get back the yaml config, it's still the same.

I'll probably update the docs warning about windows not being supported in the SUC and add that matchExpression in the example, so that it is less likely that people get confused

The thing is, I didn't create this plan myself. So even with a documentation updated properly, some people will still face the issu

Ok, I thought this was a pure RKE2 issue but I now understand it's a Rancher issue. I'll have a look at how RM is generating that plan. I think updating the docs will also help users that are doing the upgrade by following the RKE2 docs https://docs.rke2.io/upgrade/automated_upgrade

brandond commented 2 weeks ago

Yeah, if this is the Rancher-managed SUC deployment and plan, then this is a Rancher issue, not RKE2.

I don't know that Rancher currently supports imported clusters with Windows nodes, I suspect it only properly handles Windows clusters that are provisioned via Rancher. I would defer to the support matrix as to whether or not this is something that is supposed to work.