Taints are not removed from nodes

DipanshuSehjal commented 2 years ago

Version - HA controller 1.1.0

Many times we have seen that taints are not removed from nodes so, pods are not scheduled. Moreover, the taints come back on nodes as soon as you remove them manually. Taints mostly occur when nodes are in stage of rebooting for example, during node upgrade and reboot. Additionally, both replicas of 2 resources also went into Outdated state.

For instance,

[~]# kubectl describe node | grep -i taint
Taints:             drbd.linbit.com/lost-quorum:NoSchedule
Taints:             drbd.linbit.com/force-io-error:NoSchedule
Taints:             drbd.linbit.com/lost-quorum:NoSchedule

| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode1.galwayan.com | 7000 | Unused | Ok                                                                                                      |   Outdated | 2022-09-12 14:20:06 |
| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode2.galwayan.com | 7000 | Unused | Ok                                                                                                      |   Outdated | 2022-09-12 14:20:02 |
| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode3.galwayan.com | 7000 | Unused | Ok                                                                                                      |   Diskless | 2022-09-12 14:20:04 |

Settings defined in storage class as per HA-controller requirements -

  DrbdOptions/auto-quorum: suspend-io
  DrbdOptions/Resource/on-no-data-accessible: suspend-io
  DrbdOptions/Resource/on-suspended-primary-outdated: force-secondary
  DrbdOptions/Net/rr-conflict: retry-connect

WanzenBug commented 2 years ago

Taints mostly occur when nodes are in stage of rebooting for example, during node upgrade and reboot.

That is expected, as the taints (at least the drbd.linbit.com/lost-quorum ones) are added when one node looks unreachable from another. During a reboot this is obviously the case. The question is why the taint is not removed after the node is back online and the satellite + DRBD is running again.

In the above case, its probably related to the outdated pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 resource. Could you please collect kernel logs on all 3 nodes for that resource: journalctl -t kernel --grep pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1.

Lastly, there is one drbd.linbit.com/force-io-error taint, which would indicate that one of the nodes has the drbd device open, but is currently trying to become secondary. Could you check which node has that taint and see what's up with that resource? The output of drbdsetup status -v on that node should also show force-io-error:yes