rancher / dashboard

The Rancher UI
https://rancher.com
Apache License 2.0
442 stars 243 forks source link

Rancher server logs are spammed with logs when one of the node of a downstream RKE2 cluster is in `Reconciling` state (UI sends null) #8480

Open dasarinaidu opened 1 year ago

dasarinaidu commented 1 year ago

Rancher Server Setup

Information about the Cluster

User Information

Describe the bug Sometimes when performing an HA Upgrade to Rancher v2.7.0 TO v2.7.2-rc6, while having an AWS RKE2 custom cluster - the cluster will go into Updating state and the etcd nodes will go into a Reconciling state and the INFO logs are spammed and keep coming the same message.

To Reproduce

  1. Create a HA Rancher serve on v2.7.0
  2. Provision a downstream AWS RKE2 custom cluster(hardened) with k8 version : v1.23.14+rke2r1
  3. Perform an HA Upgrade on Rancher v2.7-head (v2.7.2-rc6)
  4. Make sure the upgrade success and downstream cluster came back active
  5. Upgrade downstream cluster kubarnetes version to to v1.25.7+rke2r1 and SAVE
  6. Check for the system behavior, Logs and cluster state

Result a. Downstream cluster initially came back active and then it went to update state with an error and it never came back to Active b. Rancher logs are spammed with INFO, these logs are keep coming for every second c. One of the node (etcd) is in Reconciling state and never came back with message on UI (Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet)

Expected Result a. Rancher logs should not be spammed with the logs every second.

LOGS 2023/03/17 18:36:58 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:36:58 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:03 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:03 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:08 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:08 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:13 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:13 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:18 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:18 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:23 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:23 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:28 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:28 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:33 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:33 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:38 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:38 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:43 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:43 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:48 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:48 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:51 [ERROR] error syncing 'fleet-default/custom-5e4c4afa8ddd': handler unmanaged-machine: the server was unable to return a response in the time allotted, but may still be processing the request (get nodes ip-172-31-2-93), requeuing

Oats87 commented 1 year ago

This is a UI bug.

The root cause of the problem is the fact that we are delivering an invalid configuration to RKE2, namely specifying profile: null in machineGlobalConfig.

This is very easily reproducible -- create a new v2prov cluster in the UI, click "Edit as YAML", and note that profile: null under machineGlobalConfig.

Screenshot 2023-03-20 at 5 44 45 PM

See attached screenshot.

snasovich commented 1 year ago

@nwmac @gaktive , this is likely a side-effect of recent changes for profile behavior to accommodate k8s 1.25 changes so possibly a regression we don't want to happen in the upcoming 2.7.2 release. Marking as regression and release-blocker.

nwmac commented 1 year ago

@snasovich Backend should ignore the null profile. If the configuration is invalid, backend should reject the request.

nwmac commented 1 year ago

/fordwardport v2.7.next2

snasovich commented 1 year ago

@snasovich Backend should ignore the null profile. If the configuration is invalid, backend should reject the request.

@nwmac , I agree with this in principle, but it's still a change of the behavior on UI's part that we want to be reverted. I see it's already in test, thank you for a quick turn-around.

Sahota1225 commented 1 year ago

https://github.com/rancher/rancher/issues/40942 - backend issue

jameson-mcghee commented 1 year ago

Setting this ticket to Reopened until a Test Template can be provided.

cc @nwmac

jameson-mcghee commented 1 year ago

It was determined by management that this ticket does not need a Test Template to complete testing. Therefore I am setting this ticket back to QA Working.

jameson-mcghee commented 1 year ago

During testing upgrading from v2.7.0 to v2.7-head (Commit ID: 41b2cf3), I did not observe the issue with the spammed logs when following the Steps to Reproduce. Therefore I am closing this ticket as Done.

Test Steps:

  1. Create a HA Rancher serve on v2.7.0
  2. Provision a Hardened downstream RKE2 custom cluster with K8s v1.23.14+rke2r1 (depreciated)
  3. Perform an HA Upgrade on Rancher v2.7-head
  4. Make sure the upgrade success and downstream cluster came back active
  5. Upgrade downstream cluster Kubernetes version to to v1.25.7+rke2r1 and SAVE
  6. Check for the system behavior, Logs and cluster state

Screenshots: image image image image

Rancher Logs: Rancher Logs

sowmyav27 commented 1 year ago

Reopen for some more validations.

Josh-Diamond commented 1 year ago

Ticket #8480 - Test Results - ✅

With Docker on a single-node instance of Rancher v2.7.0 => v2.7.2-rc6:

❗ Unable to reproduce ❗

Scenario Test Case Result
1. Provision downstream rke2 w/ k8s 1.23 on 2.7.0
2. Upgrade Rancher from 2.7.0 => 2.7.2-rc6, then upgrade cluster's k8s version to 1.25; observe profile field of machineGlobalConfig is left out of request
3. Provision downstream rke2 w/ k8s 1.25 on v2.7.2-rc6

Scenario 1:

  1. Fresh install of rancher v2.7.0
  2. Provision a downstream rke2 k8s 1.23 cluster
  3. Verified - cluster successfully provisions; logs not flooded

Scenario 2:

  1. Pick up where scenario 1 ^^ left off - upgrade rancher to v2.7.2-rc6
  2. Once upgraded to v2.7.2-rc6, upgrade rke2 cluster k8s version to 1.25, opening the network tab and observe profile field of machineGlobalConfig is left out of request (since it is set to null)
  3. Verified - cluster successfully upgrades; logs not flooded; request accurately matches UI changes

Scenario 3:

  1. Fresh install of rancher v2.7.2-rc6
  2. Provision a downstream rke2 k8s 1.25 cluster
  3. Verified - cluster successfully provisions; logs not flooded
Josh-Diamond commented 1 year ago

Re-opening due to the following concerns:

The fix that went in, which did resolve the flooded logs symptom the issue reported, seems to have removed profile: null from the request body sent from the UI. The field still persists and is seen in View YAML from rancher UI. This is unexpected. Re-opening to request this field gets hidden behind #

Oats87 commented 1 year ago

To clarify here, I would expect that I can copy-paste the YAML that comes out of the UI's Edit as YAML (or View as YAML) interface and directly kubectl apply|create it in my cluster and end up with an identical result.

If the UI is selectively ignoring fields when marshalling the object and sending it over the wire, that is not ideal, although it does I suppose provide a resolution to the originally reported bug.

The behavior I would expect to see is

Scenario A: If the user does not define a profile for their cluster and edits the cluster as YAML, there should be no profile key at all in the machineGlobalConfig, or if there is, it should be commented out like we do the rest of the fields that do not have explicitly set values, for example see bucket in the screenshot here: https://github.com/rancher/dashboard/issues/8480#issuecomment-1477180460

Scenario B: If the user defines a profile for their cluster, I would expect to see the profile: cis-1.5 entry under machineGlobalConfig

There is no case in which null is a valid value for profile, but profile is a valid configuration parameter for RKE2. We do not have a mechanism to perform value validation at this point for RKE2/K3s (that is a feature enhancement on the K3s/RKE2 side), and thus on the backend side would have to hard code some type of validation to ensure that null is not being delivered.

gaktive commented 1 year ago

This isn't a blocker for 2.7.2, which is now in the 11th hour of its release, and the related backend ticket is also pushed out. Moving to Q2.

nwmac commented 1 year ago

The remaining work is to comment out the profile key when editing as yaml - this is currently driven off of data that comes from the backend via the schema.

Bumping out of 2.7.next2 - tis requires for investigation and will require test automation

gaktive commented 1 year ago

@Oats87 we should coordinate with you and/or your team about how UI & backend should handle this.

gaktive commented 1 year ago

cc @Sahota1225

gaktive commented 10 months ago

This may be something the UI is doing when trying to populate fields we expect, so we should scrub upon save.