sapcc / kubernikus

Kubernetes as a Service for Openstack
Apache License 2.0
140 stars 26 forks source link

cluster termination stuck due to pvc/pv cleanup #603

Closed databus23 closed 3 years ago

databus23 commented 3 years ago

We have clusters stuck in out e2e pipeline on a regular basis clogging up the soak test. We should look into fixing those automatically instead of us cleaning up manually in all regions.

Just today I looked into a stuck e2e cluster and I observed the following situation:

  1. pvc was stuck in Terminating, no events: fixed that by removing the pvc-protection finalizer, There were no pods left in the api so it was not clear to me why the finalizer was not removed.

  2. After the pvc was removed the pv was stuck in Terminating and events where popping up: Warning VolumeFailedDelete 3m27s (x9 over 7m42s) cinder.csi.openstack.org_e2e-tz49n-4364e670df6e4e1ea1a355b80e6275d7-csi-7c949dc4d-m6j4w_fe2712f2-5e49-491f-8586-9a32d87c9083 persistentvolume pv-e2e-tz49n-4364e670df6e4e1ea1a355b80e6275d7-67e3c736-dce7-4fc9-8d21-0cf166637595 is still attached to node kks-e2e-tz49n-small-rsx28 There were no nodes in the clusters and no volumes remaining in the project so this is also puzzling.

databus23 commented 3 years ago

This whole problem seems to be caused by the etcd restore from backup that we do in the e2e test prior to deleting the clusters. The problem is that this sometimes trips up the informer cache in the apiserver so that resources that are long gone are still returned by a watch. Deleting the apiserver seems to be the only thing this is required to unblock the deletion of pvc and pv.

Somehow this feels like groundhog day. We already found that out a while back and implemented a liveness probe on the apiserver to force a restart after the etcd is restored from backup.

It seems like this is not working anymore fore newer versions of kubernetes were we changed the liveness probe. I don't see any restarts in the e2e test for the apiserver. We should add an e2e test for that and then of course fix the liveness probe.