longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.96k stars 587 forks source link

[BUG] Warning events are being spammed by Longhorn - CRD #7290

Closed larivierec closed 8 months ago

larivierec commented 9 months ago

Describe the bug (πŸ› if you encounter this issue)

Warning events are being spammed by Longhorn because the Longhorn Node CRD doesn't seem to recognize a new message in the latest k3s release.

"Unknown condition true of Kubernetes node : condition is of the EtcdIsVoter reason is, MemberNotLearner, message is Node is a voting member of etcd cluster."

To Reproduce

Nodes that are ready will broadcast a new condition not known by the Node CRD v1beta2 and spam warning events in longhorn-system namespace.

Expected behavior

No events or logs should be present.

Environment

Longhorn: 1.5.3 K3s: 1.28.4-k3s2

3 master nodes 5 workers

Ubuntu 22.04 baremetal.

Additional context

m-ildefons commented 9 months ago

This warning is caused by K3s setting a new node condition. This was introduced with c5cd7b3d6543ef782c84651b5c46e904ca83828b, after an issue and adr Various K8s implementations or cloud providers are allowed to set their own node conditions and the set provided in the K8s API is just a minimal set guaranteed to be there: https://pkg.go.dev/k8s.io/api/core/v1#NodeConditionType

There is nothing to worry about with this particular node condition as it only relays information about which node's etcd instance takes what kind of responsibility, so adding and removing nodes from a k3s cluster taking this information into account doesn't jeopardize the quorum of the etcd cluster.

Longhorn exposes all unknown (to longhorn) node conditions as K8s events. This is a mechanism that only has informational value and is used as such in Longhorn as well, as it only serves to relay information to the UI. The warnings are harmless in this case.

An argument can be made to ignore all unknown node conditions by default, or alternatively just ignore this particular one. What do you think @innobead , should Longhorn assume that an unknown node condition is worthy of being a warning, just as it is now? It should be noted here that events are only emitted for node conditions that are "true", but not for ones that are "false". However there is no convention or guidance if "true" or "false" should indicate an error condition or healthiness. I think the best course of action would be to just ignore non-standard node conditions except when we can derive value from them.

larivierec commented 9 months ago

Understood.

Yeah, I saw the commit in the K3s repository yesterday and noticed it's a new addition from 3 weeks ago.

The reason for this issue is mainly because we should have a little bit of control to what should be logged in especially when it's harmless.

(⎈|k3s:default)➜  home-cluster git:(main) βœ— k get events -n longhorn-system
LAST SEEN   TYPE      REASON                     OBJECT           MESSAGE
4m9s        Warning   UnknownNodeConditionTrue   node/fluffy      Unknown condition true of kubernetes node fluffy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
4m9s        Warning   UnknownNodeConditionTrue   node/fluffy      Unknown condition true of kubernetes node fluffy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
4m9s        Warning   UnknownNodeConditionTrue   node/fluffy      Unknown condition true of kubernetes node fluffy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
4m12s       Warning   UnknownNodeConditionTrue   node/fluffy      Unknown condition true of kubernetes node fluffy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
4m9s        Warning   UnknownNodeConditionTrue   node/fluffy      Unknown condition true of kubernetes node fluffy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
4m9s        Warning   UnknownNodeConditionTrue   node/fluffy      Unknown condition true of kubernetes node fluffy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
4m9s        Warning   UnknownNodeConditionTrue   node/fluffy      Unknown condition true of kubernetes node fluffy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
2m38s       Warning   UnknownNodeConditionTrue   node/fluffy      Unknown condition true of kubernetes node fluffy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
3m15s       Warning   UnknownNodeConditionTrue   node/frenzy      Unknown condition true of kubernetes node frenzy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
3m15s       Warning   UnknownNodeConditionTrue   node/frenzy      Unknown condition true of kubernetes node frenzy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
3m15s       Warning   UnknownNodeConditionTrue   node/frenzy      Unknown condition true of kubernetes node frenzy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
3m15s       Warning   UnknownNodeConditionTrue   node/frenzy      Unknown condition true of kubernetes node frenzy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
3m15s       Warning   UnknownNodeConditionTrue   node/frenzy      Unknown condition true of kubernetes node frenzy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
3m15s       Warning   UnknownNodeConditionTrue   node/frenzy      Unknown condition true of kubernetes node frenzy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
3m15s       Warning   UnknownNodeConditionTrue   node/frenzy      Unknown condition true of kubernetes node frenzy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
2m38s       Warning   UnknownNodeConditionTrue   node/frenzy      Unknown condition true of kubernetes node frenzy: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
2m9s        Warning   UnknownNodeConditionTrue   node/whirlwind   Unknown condition true of kubernetes node whirlwind: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
2m9s        Warning   UnknownNodeConditionTrue   node/whirlwind   Unknown condition true of kubernetes node whirlwind: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
2m9s        Warning   UnknownNodeConditionTrue   node/whirlwind   Unknown condition true of kubernetes node whirlwind: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
2m9s        Warning   UnknownNodeConditionTrue   node/whirlwind   Unknown condition true of kubernetes node whirlwind: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
2m9s        Warning   UnknownNodeConditionTrue   node/whirlwind   Unknown condition true of kubernetes node whirlwind: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
2m9s        Warning   UnknownNodeConditionTrue   node/whirlwind   Unknown condition true of kubernetes node whirlwind: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
2m9s        Warning   UnknownNodeConditionTrue   node/whirlwind   Unknown condition true of kubernetes node whirlwind: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
3m7s        Warning   UnknownNodeConditionTrue   node/whirlwind   Unknown condition true of kubernetes node whirlwind: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster

Right now, for each of my master nodes, it seems to be emitting 7 messages / 60 seconds. Could unknown conditions be ignored through a chart option to the longhorn application perhaps?

aumer-amr commented 9 months ago

I second this, after upgrading it's quite spamming the events. An ignore on these unknown conditions by default, or giving the option would be highly appreciated.

joryirving commented 9 months ago

FYI for context, I don't get these messages, however my control planes are tainted as NoSchedule, so Longhorn never runs on my control plane nodes.

brandond commented 9 months ago

This comes from here: https://github.com/longhorn/longhorn-manager/blob/268b1d63b9570e750bc13e011f475a17e6878041/controller/node_controller.go#L455-L458

See https://kubernetes.io/docs/reference/node/node-status/#condition: Some conditions are errors and status: True is bad, for others like β€œReady”, status: True is good. LH seems to have assumed that all unknown conditions will be errors, and warns if they are true. This is an incorrect assumption.

PhanLe1010 commented 9 months ago

An argument can be made to ignore all unknown node conditions by default, or alternatively just ignore this particular one. What do you think @innobead , should Longhorn assume that an unknown node condition is worthy of being a warning, just as it is now? It should be noted here that events are only emitted for node conditions that are "true", but not for ones that are "false". However there is no convention or guidance if "true" or "false" should indicate an error condition or healthiness. I think the best course of action would be to just ignore non-standard node conditions except when we can derive value from them.

Ignoring unknown condition looks reasonable to me. The other approach is at least we should not repeatedly spam the API server by only emitting the unknown condition once?

brandond commented 9 months ago

My personal preference would be to have LH ignore any conditions that it doesn't specifically derive information from. If users want to be alerted about specific conditions on their nodes, there are tools other than LH that will handle this.

innobead commented 9 months ago

My personal preference would be to have LH ignore any conditions that it doesn't specifically derive information from. If users want to be alerted about specific conditions on their nodes, there are tools other than LH that will handle this.

Agreed. Should just do it explicitly by watching what conditions we care about. @m-ildefons Please continue working on it. Thanks.

ijorjadze commented 8 months ago

i find those spummy events today after upgrade 1.25 RKE2 to 1.26 RKE2, i have longhorn v1.5.1 installed, so fix will be on v1.6.0 version ?

larivierec commented 8 months ago

any updates on this?

innobead commented 8 months ago

This will be fixed in the upcoming 1.6.0 and backported to 1.5.4 & 1.4.5.

@m-ildefons you forgot to update the status of the issue. I just moved to ready-for-testing and QA can follow up the reproducing step to see if the unknown node condition will still be watched and reemit as longhorn events.

roger-ryao commented 8 months ago

Verified on master-head 20231225

The test steps

https://github.com/longhorn/longhorn/issues/7290#issue-2031804175

  1. Install k3s Server and Initialize the Cluster NOTE: For this test case, node taints node-role.kubernetes.io/master=true:NoExecute & node-role.kubernetes.io/master=true:NoSchedule on the control plane node have NOT been added.
    curl -sfL https://get.k3s.io | K3S_TOKEN=SECRET INSTALL_K3S_EXEC="server --cluster-init" sh -
  2. Check Control Node's Conditions
    # Replace CONTROL_NODE_NAME with the actual name of your control node
    kubectl get node $CONTROL_NODE_NAME -o=jsonpath='Node Name: {.metadata.name}{"\n"}Conditions:{"\n"}{range .status.conditions[*]}- Type: {.type}{"\n"}  Status: {.status}{"\n"}  LastHeartbeatTime: {.lastHeartbeatTime}{"\n"}  LastTransitionTime: {.lastTransitionTime}{"\n"}  Reason: {.reason}{"\n"}  Message: {.message}{"\n\n"}{end}'
  3. Install Longhorn
  4. Check events in the longhorn-system namespace
kubectl -n longhorn-system get events

Result Passed

  1. Installed Longhorn v1.5.3 on k3s cluster and observed warnings in longhorn-system namespace:
65s         Warning   UnknownNodeConditionTrue   node/ip-10-0-1-57   Unknown condition true of kubernetes node ip-10-0-1-57: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
39s         Warning   UnknownNodeConditionTrue   node/ip-10-0-1-57   Unknown condition true of kubernetes node ip-10-0-1-57: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
38s         Warning   UnknownNodeConditionTrue   node/ip-10-0-1-57   Unknown condition true of kubernetes node ip-10-0-1-57: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
24s         Warning   UnknownNodeConditionTrue   node/ip-10-0-1-57   Unknown condition true of kubernetes node ip-10-0-1-57: condition type is EtcdIsVoter, reason is MemberNotLearner, message is Node is a voting member of the etcd cluster
  1. Inspected node conditions and found new condition EtcdIsVoter:
- Type: EtcdIsVoter
  Status: True
  LastHeartbeatTime: 2023-12-25T14:16:26Z
  LastTransitionTime: 2023-12-25T14:16:26Z
  Reason: MemberNotLearner
  Message: Node is a voting member of the etcd cluster
  1. Checked control plane node events in Longhorn master-head and did not observe EtcdIsVoter warnings:
Ashkaan commented 8 months ago

I tried Longhorn and it didn't work for my need. I removed it from my cluster, but I still have the warnings (in Lens). How do I clear those warnings?

Sierra1011 commented 8 months ago

Any idea when we can expect 1.4.5? I accidentally encountered this bug while on 1.4 right before Christmas (idle hands make for poor decisions :grin: ) and this has crippled my homelab.

Ashkaan commented 8 months ago

Have you figured out how to clear the warnings? There must be some way to edit ETCD (or wherever that's stored).

omidraha commented 8 months ago

I also received this warning when I installed longhorn on the specific node. https://github.com/longhorn/longhorn/issues/7407#issuecomment-1881838982

PhanLe1010 commented 8 months ago

I tried Longhorn and it didn't work for my need. I removed it from my cluster, but I still have the warnings (in Lens). How do I clear those warnings?

Hi @Ashkaan What warning are you still seeing after uninstall Longhorn? Have you followed the instructions to uninstall Longhorn https://longhorn.io/docs/1.5.3/deploy/uninstall/ ?

PhanLe1010 commented 8 months ago

Any idea when we can expect 1.4.5? I accidentally encountered this bug while on 1.4 right before Christmas (idle hands make for poor decisions 😁 ) and this has crippled my homelab.

Hi @Sierra1011 Do you mean the cluster is frozen because of the warning events?

brandond commented 8 months ago

LH has always had this behavior, you can't downgrade LH to resolve it. Until the new release of LH is available, the only way to stop getting the events is to downgrade to an older release of K3s/RKE2.

Ashkaan commented 8 months ago

I tried Longhorn and it didn't work for my need. I removed it from my cluster, but I still have the warnings (in Lens). How do I clear those warnings?

Hi @Ashkaan What warning are you still seeing after uninstall Longhorn? Have you followed the instructions to uninstall Longhorn https://longhorn.io/docs/1.5.3/deploy/uninstall/ ?

Yes, I followed this: helm uninstall longhorn -n longhorn-system

I don't have anything longhorn installed anywhere (so far as I can tell) and I get this:

image
brandond commented 8 months ago

You get that where? Those events are 23 days old. Normally events don't hang around that long.

Ashkaan commented 8 months ago

Correct. These have been lingering and driving our team crazy. We can see them in Lens. It happened as soon as we tested longhorn for a project.

brandond commented 8 months ago

Have you tried deleting the events?

Has someone customized your cluster configuration to retain events for longer than usual? The default event ttl is a couple hours, if I remember correctly.

Ashkaan commented 8 months ago

Sadly, I haven't yet learned how to delete events or customize the TTL. If you have any docs or advice, I'd really appreciate it.

Also, all other warnings go away within a couple of hours. These always stay.

Sierra1011 commented 8 months ago

@PhanLe1010

Any idea when we can expect 1.4.5? I accidentally encountered this bug while on 1.4 right before Christmas (idle hands make for poor decisions 😁 ) and this has crippled my homelab.

Hi @Sierra1011 Do you mean the cluster is frozen because of the warning events?

TL;DR I am keen to know in what sort of timeframe a backport will become available.

Even after downgrading and reverting K3s versions etc, I continued getting these alerts preventing volumes being mounted. I tried removing taints by hand and deleting all the events, but no cigar. So I can't upgrade K3s version above a certain point without stopping volumes working.

It actually started/aligned with the start of a whole saga of storage issues for me that took a lot of fixing, and while I'm not certain exactly of the root cause, it's a lot of coincidence.

I'm not trying to be petulant, apologies if I'm coming across that way; I'm reluctant to do a knee-jerk upgrade from V1 to V2 when there's only one supported release that won't break with any vaguely recent version of K3s. TIA

brandond commented 8 months ago

@Sierra1011 I continued getting these alerts preventing volumes being mounted.

These events are purely informational and do not prevent volumes from being mounted. You can see by examining the PR that the only change to "fix" this is to stop emitting the events.

If you are have having problems with your LH volumes, look further for the actual root cause. This is not the source of your issue.

jLemmings commented 7 months ago

Is there any workaround for this issue? I'm also getting flooded with multiple 100'00 errors... Kubernetes: v1.26.11 rke2r1 Longhorn: v1.5.3

Sierra1011 commented 7 months ago

I left it another few weeks and tried the upgrade again, prompted by Flux 2.2 requiring K8s API version >=1.26.

Still on LH 1.4.4. As soon as I upgraded to 1.26(.13.k3s2), it will not provision new volumes in response to PVCs, and the logs are full of the same EtcdIsVoter event. I can create volumes manually and link them up so it would seems likely to be unrelated: I will investigate further but consider me out of this particular issue for now.

Thanks!

cs-shadowbq commented 5 months ago

Rancher UI + Longhorn is still only listed as 103.2.1+up1.5.3 ... https://github.com/rancher/charts/tree/release-v2.9/charts/longhorn .. still waiting for 1.5.4 or 1.6.0 to drop

brandond commented 5 months ago

Rancher UI + Longhorn is still only listed as 103.2.1+up1.5.3

That is a Rancher issue. Not Longhorn, and not handled by this team.

Delta1977 commented 5 months ago

No, the root case belongs to longhorn and is solved in longhorn 1.5.4 and 1.6.0. longhorn 1.5.4 is shot before release in rancher AppStore

Longhorn capture these "unknown" events from rancher and reports it to rancher.

brandond commented 5 months ago

LH doesn't report anything directly TO rancher. It just creates Kubernetes events, same as many other components. Rancher shows events in the UI. This is all off-topic; the LH chart for Rancher will be updated when the Rancher team gets to it.

Delta1977 commented 5 months ago

its already released and LH 1.5.4 solved the log spamming !!! https://github.com/rancher/charts/commit/159d5114c62dd567723fc749318cad71f86906e7

vember31 commented 4 months ago

@Ashkaan did you ever figure out how to remove those 'Node is a voting member of the etcd cluster' warning messages from Lens? I know they aren't specifically events, they're node conditions...the events themselves are no longer being spammed by longhorn anymore thanks to this fix here. But I also haven't figured out how to stop them from showing on the Cluster page in Lens.

Maybe it's Lens that needs to be updated?

Ashkaan commented 4 months ago

No, I deleted the cluster over it. Unreal. I’ll never try Longhorn again.

On Wed, Apr 17, 2024 at 7:21β€―PM vember31 @.***> wrote:

@Ashkaan https://github.com/Ashkaan did you ever figure out how to remove those 'Node is a voting member of the etcd cluster' warning messages from Lens? I know they aren't specifically events, they're node conditions...the events themselves are no longer being spammed by longhorn anymore thanks to this fix here. But I also haven't figured out how to stop them from showing on the Cluster page in Lens.

Maybe it's Lens that needs to be updated?

β€” Reply to this email directly, view it on GitHub https://github.com/longhorn/longhorn/issues/7290#issuecomment-2062873219, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLHZCT2NPC3HF5EHDWNY4LY54UZ7AVCNFSM6AAAAABAL7F3ACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSHA3TGMRRHE . You are receiving this because you were mentioned.Message ID: @.***>