krkn-chaos / cerberus

Guardian of Kubernetes clusters. Tool to monitor clusters health and signal/alert on failures.
Apache License 2.0
92 stars 41 forks source link

Not catching when nodes are Ready,SchedulingDisabled #203

Open paigerube14 opened 1 year ago

paigerube14 commented 1 year ago

Cerberus run passes but 2 nodes are SchedulingDisabled, should catch this as a failure or make it an option

Node has taint: Taints: node.kubernetes.io/unschedulable:NoSchedule

07-12 15:50:03.255  
07-12 15:50:03.255                 _                         
07-12 15:50:03.255    ___ ___ _ __| |__   ___ _ __ _   _ ___ 
07-12 15:50:03.255   / __/ _ \ '__| '_ \ / _ \ '__| | | / __|
07-12 15:50:03.255  | (_|  __/ |  | |_) |  __/ |  | |_| \__ \
07-12 15:50:03.255   \___\___|_|  |_.__/ \___|_|   \__,_|___/
07-12 15:50:03.255                                           
07-12 15:50:03.255  
07-12 15:50:03.839  Error: unknown flag: --duration
07-12 15:50:03.839  See 'oc create --help' for usage.
07-12 15:50:04.397  2023-07-12 19:50:04,235 [WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2633)'))': /api/v1/namespaces/openshift-kube-apiserver-operator/pods?pretty=True&limit=100
07-12 15:50:04.397  2023-07-12 19:50:04,237 [WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))': /api/v1/namespaces/openshift-user-workload-monitoring/pods?pretty=True&limit=100
07-12 15:50:04.397  2023-07-12 19:50:04,237 [WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2633)'))': /api/v1/namespaces/openshift-machine-api/pods?pretty=True&limit=100
07-12 15:50:11.055  2023-07-12 19:50:09,944 [INFO] Iteration 1: No Terminating Namespaces status: True
07-12 15:50:11.055  2023-07-12 19:50:09,966 [INFO] Iteration 1: Node status: True
07-12 15:50:11.055  2023-07-12 19:50:10,174 [INFO] Iteration 1: Cluster Operator status: True
07-12 15:50:11.055  2023-07-12 19:50:10,206 [INFO] Iteration 1: openshift-user-workload-monitoring: True
07-12 15:50:11.055  2023-07-12 19:50:10,207 [INFO] Iteration 1: openshift-ovirt-infra: True
07-12 15:50:11.055  2023-07-12 19:50:10,228 [INFO] Iteration 1: openshift-host-network: True
07-12 15:50:11.055  2023-07-12 19:50:10,230 [INFO] Iteration 1: openshift-kube-apiserver-operator: True
07-12 15:50:11.055  2023-07-12 19:50:10,253 [INFO] Iteration 1: openshift-insights: True
07-12 15:50:11.055  2023-07-12 19:50:10,263 [INFO] Iteration 1: openshift-machine-api: True
07-12 15:50:11.055  2023-07-12 19:50:10,271 [INFO] Iteration 1: openshift-cluster-machine-approver: True
07-12 15:50:11.055  2023-07-12 19:50:10,280 [INFO] Iteration 1: openshift-cluster-version: True
07-12 15:50:11.055  2023-07-12 19:50:10,293 [INFO] Iteration 1: openshift-service-ca-operator: True
07-12 15:50:11.055  2023-07-12 19:50:10,296 [INFO] Iteration 1: openshift-kni-infra: True
07-12 15:50:11.055  2023-07-12 19:50:10,311 [INFO] Iteration 1: openshift-apiserver-operator: True
07-12 15:50:11.055  2023-07-12 19:50:10,318 [INFO] Iteration 1: openshift-infra: True
07-12 15:50:11.055  2023-07-12 19:50:10,343 [INFO] Iteration 1: openshift-kube-storage-version-migrator: True
07-12 15:50:11.055  2023-07-12 19:50:10,347 [INFO] Iteration 1: openshift-cloud-credential-operator: True
07-12 15:50:11.055  2023-07-12 19:50:10,354 [INFO] Iteration 1: openshift-ingress: True
07-12 15:50:11.055  2023-07-12 19:50:10,383 [INFO] Iteration 1: openshift-cluster-samples-operator: True
07-12 15:50:11.055  2023-07-12 19:50:10,388 [INFO] Iteration 1: openshift-console: True
07-12 15:50:11.055  2023-07-12 19:50:10,413 [INFO] Iteration 1: openshift-node: True
07-12 15:50:11.055  2023-07-12 19:50:10,445 [INFO] Iteration 1: openshift-kube-controller-manager-operator: True
07-12 15:50:11.055  2023-07-12 19:50:10,457 [INFO] Iteration 1: openshift-oauth-apiserver: True
07-12 15:50:11.055  2023-07-12 19:50:10,480 [INFO] Iteration 1: openshift-openstack-infra: True
07-12 15:50:11.055  2023-07-12 19:50:10,506 [INFO] Iteration 1: openshift-config-managed: True
07-12 15:50:11.055  2023-07-12 19:50:10,649 [INFO] Iteration 1: openshift-operator-lifecycle-manager: True
07-12 15:50:11.055  2023-07-12 19:50:10,742 [INFO] Iteration 1: openshift-dns-operator: True
07-12 15:50:11.055  2023-07-12 19:50:10,766 [INFO] Iteration 1: kube-node-lease: True
07-12 15:50:11.055  2023-07-12 19:50:10,865 [INFO] Iteration 1: default: True
07-12 15:50:11.643  2023-07-12 19:50:11,353 [INFO] Iteration 1: openshift-machine-config-operator: True
07-12 15:50:11.643  2023-07-12 19:50:11,549 [INFO] Iteration 1: openshift-kube-apiserver: True
07-12 15:50:11.940  2023-07-12 19:50:11,643 [INFO] Iteration 1: openshift-console-operator: True
07-12 15:50:11.940  2023-07-12 19:50:11,646 [INFO] Iteration 1: openshift-image-registry: True
07-12 15:50:11.940  2023-07-12 19:50:11,679 [INFO] Iteration 1: openshift-vsphere-infra: True
07-12 15:50:11.940  2023-07-12 19:50:11,742 [INFO] Iteration 1: openshift-kube-scheduler-operator: True
07-12 15:50:11.940  2023-07-12 19:50:11,767 [INFO] Iteration 1: kube-public: True
07-12 15:50:11.940  2023-07-12 19:50:11,857 [INFO] Iteration 1: openshift-marketplace: True
07-12 15:50:12.195  2023-07-12 19:50:11,962 [INFO] Iteration 1: openshift-apiserver: True
07-12 15:50:12.195  2023-07-12 19:50:12,064 [INFO] Iteration 1: openshift: True
07-12 15:50:12.195  2023-07-12 19:50:12,069 [INFO] Iteration 1: openshift-monitoring: True
07-12 15:50:12.195  2023-07-12 19:50:12,142 [INFO] Iteration 1: openshift-ingress-operator: True
07-12 15:50:12.195  2023-07-12 19:50:12,171 [INFO] Iteration 1: openshift-config: True
07-12 15:50:12.491  2023-07-12 19:50:12,249 [INFO] Iteration 1: openshift-authentication: True
07-12 15:50:12.491  2023-07-12 19:50:12,251 [INFO] Iteration 1: openshift-controller-manager: True
07-12 15:50:12.491  2023-07-12 19:50:12,273 [INFO] Iteration 1: kube-system: True
07-12 15:50:12.491  2023-07-12 19:50:12,342 [INFO] Iteration 1: openshift-etcd-operator: True
07-12 15:50:12.839  2023-07-12 19:50:12,543 [INFO] Iteration 1: openshift-kube-controller-manager: True
07-12 15:50:12.839  2023-07-12 19:50:12,638 [INFO] Iteration 1: openshift-operators: True
07-12 15:50:12.839  2023-07-12 19:50:12,646 [INFO] Iteration 1: openshift-etcd: True
07-12 15:50:12.839  2023-07-12 19:50:12,741 [INFO] Iteration 1: openshift-kube-storage-version-migrator-operator: True
07-12 15:50:12.839  2023-07-12 19:50:12,742 [INFO] Iteration 1: openshift-network-operator: True
07-12 15:50:12.839  2023-07-12 19:50:12,775 [INFO] Iteration 1: openshift-authentication-operator: True
07-12 15:50:13.094  2023-07-12 19:50:12,841 [INFO] Iteration 1: openshift-service-ca: True
07-12 15:50:13.094  2023-07-12 19:50:12,865 [INFO] Iteration 1: openshift-controller-manager-operator: True
07-12 15:50:13.094  2023-07-12 19:50:12,875 [INFO] Iteration 1: openshift-cluster-storage-operator: True
07-12 15:50:13.094  2023-07-12 19:50:12,943 [INFO] Iteration 1: openshift-cluster-node-tuning-operator: True
07-12 15:50:13.094  2023-07-12 19:50:12,950 [INFO] Iteration 1: openshift-config-operator: True
07-12 15:50:13.094  2023-07-12 19:50:12,984 [INFO] Iteration 1: openshift-multus: True
07-12 15:50:14.040  2023-07-12 19:50:13,858 [INFO] Iteration 1: openshift-cluster-csi-drivers: True
07-12 15:50:14.294  2023-07-12 19:50:14,046 [INFO] Iteration 1: openshift-dns: True
07-12 15:50:14.853  2023-07-12 19:50:14,639 [INFO] Iteration 1: openshift-ovn-kubernetes: True
07-12 15:50:14.853  2023-07-12 19:50:14,735 [INFO] Iteration 1: openshift-kube-scheduler: True
07-12 15:50:14.853  2023-07-12 19:50:14,737 [INFO] HTTP requests served: 0 
07-12 15:50:14.853  
07-12 15:50:15.411  2023-07-12 19:50:15,193 [INFO] []
07-12 15:50:15.411  
07-12 15:50:15.411  2023-07-12 19:50:15,193 [INFO] Sleeping for the specified duration: 3
% oc get nodes             
NAME                                         STATUS                     ROLES    AGE     VERSION
........
ip-10-0-214-151.us-east-2.compute.internal   Ready                      worker   4h41m   v1.19.16+8203b20
ip-10-0-215-198.us-east-2.compute.internal   Ready                      worker   4h41m   v1.19.16+8203b20
ip-10-0-215-4.us-east-2.compute.internal     Ready,SchedulingDisabled   worker   4h41m   v1.19.16+8203b20
ip-10-0-218-208.us-east-2.compute.internal   Ready                      worker   4h41m   v1.19.16+8203b20
ip-10-0-220-10.us-east-2.compute.internal    Ready                      worker   4h41m   v1.19.16+8203b20
ip-10-0-221-107.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   5h1m    v1.19.16+8203b20
ip-10-0-221-75.us-east-2.compute.internal    Ready                      master   5h12m   v1.19.16+8203b20
chaitanyaenr commented 1 year ago

Node enters Ready,SchedulingDisabled state when a user intentionally cordons the node which sets it in a maintenance mode until user uncordons it if I understand correctly. In that case, we can skip reporting as it's intentional from user perspective and cerberus tracks whether it's ready or not. Thoughts?