Rescale deployment cause pb with not ready node still present

obeyler commented 4 years ago

When you scale down the number of node inside bosh deployment yaml, node deleted are seen as No responding Pod and so on if you have to redeploy the control plane consider that the deployment failed as some node are not in ready state.

To solve that we need to do a kubectl delete node xxxx to remove this node inside the etcd. I think we can add this command inside the drain.

After scale down we have for the deleted VM.

kubectl describe node 0b577d71-ea0d-4f7b-947c-c8a13275b82b.k8s
Name:               0b577d71-ea0d-4f7b-947c-c8a13275b82b.k8s
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=0b577d71-ea0d-4f7b-947c-c8a13275b82b.k8s
                    kubernetes.io/os=linux
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 25 Aug 2020 15:27:16 +0000
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
                    node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  0b577d71-ea0d-4f7b-947c-c8a13275b82b.k8s
  AcquireTime:     <unset>
  RenewTime:       Wed, 26 Aug 2020 15:25:19 +0000
Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----                 ------    -----------------                 ------------------                ------              -------
  NetworkUnavailable   False     Tue, 25 Aug 2020 15:39:42 +0000   Tue, 25 Aug 2020 15:39:42 +0000   WeaveIsUp           Weave pod has set this
  MemoryPressure       Unknown   Wed, 26 Aug 2020 15:25:01 +0000   Wed, 26 Aug 2020 15:26:01 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure         Unknown   Wed, 26 Aug 2020 15:25:01 +0000   Wed, 26 Aug 2020 15:26:01 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure          Unknown   Wed, 26 Aug 2020 15:25:01 +0000   Wed, 26 Aug 2020 15:26:01 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready                Unknown   Wed, 26 Aug 2020 15:25:01 +0000   Wed, 26 Aug 2020 15:26:01 +0000   NodeStatusUnknown   Kubelet stopped posting node status.

jhunt commented 4 years ago

Question: is there a problem with treating "liveness" of the kubelet process as the indicator for "liveness" of the etcd node record? Put another way, does it make sense to delete the node every time the kubelet process is stopped via monit, and add it back in on monit start?

obeyler commented 4 years ago

In fact when kubelet is out the node is out. if kubelet is on the node can be ready or not depending on the network layer so on I think it's a good idea to delete node when kubelet died

jhunt commented 4 years ago

To clarify, I'm talking specifically about monit stop and monit start events. If the kubelet process crashes out-of-band, the node record should (and will!) remain in the etcd store, so that you can see and troubleshoot that: "hey, that third node is NotReady!" vs. "my 2-node cluster is healthy, I don't see what ----- hey wait a minute, where's the third node?!"

That neatly sidesteps the scenario where people are skipping drains, or haven't enabled pod drainage in the deployment.

obeyler commented 4 years ago

may I can put this inside pre-stop script and test :

BOSH_VM_NEXT_STATE = delete BOSH_INSTANCE_NEXT_STATE = delete BOSH_DEPLOYMENT_NEXT_STATE = keep

witch mean that it a Scaling down instance (https://www.bosh.io/docs/pre-stop/#environment-variables)

obeyler commented 4 years ago

In fact no we can detect inside pre-stop that it's a scaling down number of node instance and put a flag to say to the drain that it needs to delete the node. So on only in case of scaling down, node will be deleted. Play with monit start/stop on kubelet doesn't delete node. I propose this as a pr #72

jhunt commented 4 years ago

Yeah, I figured pre-stop would have to use disk to communicate to a future post-deployment hook.

Is there a reason you didn't want monit stop to handle this?

obeyler commented 4 years ago

I thought it was that you want when you said:

, I'm talking specifically about monit stop and monit start events. If the kubelet process crashes out-of-band, the node record should (and will!) remain in the etcd store, so that you can see and troubleshoot that: "hey, that third node is NotReady!" vs. "my 2-node cluster is healthy, I don't see what ----- hey wait a minute, where's the third node?!"

jhunt commented 4 years ago

I was saying I wanted the node records to persist if kubelet stops for any reason other than a monit stop

jhunt / k8s-boshrelease

Rescale deployment cause pb with not ready node still present #70