orange-cloudfoundry / k3s-wrapper-boshrelease

k3s wrapper scripts bosh release
Apache License 2.0
2 stars 2 forks source link

post-start should check Node status.Conditions (expecting Status=false) #17

Open poblin-orange opened 12 months ago

poblin-orange commented 12 months ago

In order to benefit from bosh canary / max in flight mechanism, the bosh release should check all of k8s node status.conditions @ bosh posts-start. expected state is Status=false. Status=true should result in post-start failure, thus preventing further impacts on following instance groups

eg: kubectl wait --for=condition=Ready node/agents-concourse-r1-z1-0 --timeout=10s

  conditions:                                                                                                                                                                                                      
  - lastHeartbeatTime: "2023-08-29T17:04:07Z"                                                                                                                                                                      
    lastTransitionTime: "2023-08-29T17:04:07Z"                                                                                                                                                                     
    message: Cilium is running on this node                                                                                                                                                                        
    reason: CiliumIsUp                                                                                                                                                                                             
    status: "False"                                                                                                                                                                                                
    type: NetworkUnavailable 

  - lastHeartbeatTime: "2023-09-20T15:35:02Z"                                                                                                                                                                      
    lastTransitionTime: "2023-09-09T23:53:50Z"                                                                                                                                                                     
    message: kubelet is posting ready status. AppArmor enabled                                                                                                                                                     
    reason: KubeletReady                                                                                                                                                                                           
    status: "True"                                                                                                                                                                                                 
    type: Ready 

Note that Ready has a negated Status and Ready=true should be expectec

https://kubernetes.io/docs/reference/node/node-status/#condition

Node Condition | Description -- | -- Ready | True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)

Sample standard node conditions are documented into https://kubernetes.io/docs/reference/node/node-status/#condition Additional extra node conditions can be set by 3rd party components, such as node-problem-detector see https://kubernetes.io/docs/tasks/debug/debug-cluster/monitor-node-health/#exporter

https://github.com/kubernetes/node-problem-detector/blob/ed94dff2cd827764dc43a9c90b0b3af773457dbd/config/kernel-monitor.json#L67-L70

"condition": "KernelDeadlock",