raffis commented 7 months ago

/kind feature /kind bug

Yesterday we experienced a huge downtime in our preproduction cluster. Just that this is said it was completely our fault but I think we could add various gates to kops to prevent similar things to be happen. I will first add bullets to what I suggest and then I will paste a postmortem documentation from our internal docs I just wrote (there are some hints in there for non platform engineers, sorry about that) which explains in detail what happened.

It could be that some things might be relevant to https://github.com/kubernetes-sigs/etcdadm and not kops.

I do not expect important files as control/etcd-cluster-spec beneath a backup path which only contains snapshots besides these files. I strongly suggest to move these files out of any path suggesting its "just" backups.
kops update should inform a user if these files are not found and in a more expressive way then just telling the files are created. Aka it should inform if a new etcd cluster gets seeded.
kops update should inform me about the last available and valid etcd snapshot and ask for permission if there is no recent snapshot available to proceed.
etcd-manager should not cache the spec file forever (unless a restart or leader change happens). It also recognizes that the file got deleted so I think it should also recognize if it was created again
etcd-manager metrics like last_snapshot_time and state related metrics

If even one of these suggestions are accepted I am happy to provide a pull request to implement the changes.

Now that said I'm pasting the postmortem here which explains our mistake:

Summary

In the afternoon a cluster spec change (mainly related to micro cost reductions) as well as instance groups changes have been applied via kops v1.28.4, the changes were:

<internal pr>
<internal pr>

These changes have been applied the usual way using kops edit cluster and kops edit ig <ig-name> . Once the changes were added the changes have been applied using kops update cluster --yes.

Once this command was applied the kubernetes-api was unavailable for ~5minutes. It eventually came back online. However the state was empty. Meaning the staging cluster basically was created from scratch as a new cluster. Turns out all three master nodes have initialized a new etcd cluster (The database in which the kubernetes-api server stores all resources).

In other words this update command killed the entire cluster. This outage lasted for ~3h until we eventually recovered.

Trigger

A restart of the etcd-manager (issued via a kops update cluster command).

Resolution and root cause

Now how to resolve this? Well a backup was required. Naturally kops ships etcd-manager which automatically takes etcd snapshots and stores them within the kops s3 state bucket. See https://github.com/kopeio/etcd-manager.

Usually one can now revert to the last snapshot (15min interval) and recreate the etcd cluster which will resolve this issue.

However Murphy hit us there. The last backup was uploaded in January and suddenly stopped ever since. The reason for this was this pr

<internal pr>

Which also caused that this outage happened in the first place. What this pr did was adding an S3 lifecycle rule targeting <cluster-name>/backups/etcd/main to delete snapshots older than 90days. Another micro cost saving change.

However what we were unaware is that kops/etcd-manager stores control files within the same path beneath ./control. The lifecycle rule deleted these files alongside the snapshots we actually wanted to delete. etcd-manager detects that the files were gone in its usual reconcile loop which was visible in /var/log/etcd.log.

I0321 13:02:04.989764    5119 s3fs.go:338] Reading file "s3://<bucket-name>/<cluster-name>/backups/etcd/main/control/etcd-cluster-created"
I0321 13:02:05.382051    5119 controller.go:355] detected that there is no existing cluster
I0321 13:02:05.382063    5119 commands.go:41] refreshing commands
I0321 13:02:05.481669    5119 vfs.go:119] listed commands in s3://<bucket-name>/<cluster-name>/backups/etcd/main/control: 0 commands
I0321 13:02:05.481685    5119 s3fs.go:338] Reading file "s3://<bucket-name>/<cluster-name>/backups/etcd/main/control/etcd-cluster-spec"
I0321 13:02:05.579865    5119 controller.go:388] no cluster spec set - must seed new cluster
I0321 13:02:15.581677    5119 controller.go:185] starting controller iteration

The relevant part was no cluster spec set - must seed new cluster . This check happens before an etcd snapshot is taken. And since that time no more snapshots were uploaded. We were also unaware of it because there is no monitoring for these snapshots. We only check if our backups are properly functioning before we roll out a kubernetes upgrade.

And as the message says must seed new cluster this exactly happened when the kops upgrade command restarted the etcd manager three months later. We were left with a new cluster.

Luckily etcd-manager archives the old etcd datadir to *-trashcan. The steps how it was recoverable are done as follow:

Download the etcd db from the *-trashcan folder directly from the master node using scp to a local machine.
Start a local etcd instance
Import the db using ETCDCTL_API=3 etcdctl snaspshot restore --skip-hash-check=true (This step is necessary as the trashcan is not a snapshot but rather a plain copy of the etcd data dir)
Restart instance with the new data directory created
Export a db snapshot using ETCDCTL_API=3 etcdctl snaspshot save
Gzip the snapshot created
Manually create a folder in the state s3 bucket and name it etcd.backup.gz
Recreate a _etcd_backup.meta file copied from an old existing backup (or from another cluster0)
Having both of these files in a new backup folder ready made it possible to start etcd-manager-ctl restore-backup with said backup. This command can be executed from the local machine as the entire restore process is just s3 driven.
The etcd pods need to be manually restarted on all control plane nodes using crictl stop <id>which will start the recovery process and spin up an etcd cluster with our state before the outage happened.
kube-apiserver eventually recovered (If I remember correctly I killed the processes on all control nodes also). Meaning we have api responses again and our data back.
Some follow up symptoms were that new pods didn’t start anymore. The scheduler assigned them but they didn’t start. Kubelet on all nodes did not recover apparently. I simply looped through the node list and issued a kubelet systemd restart. At this point everything recovered to normal after some pods have been restarted (like cluster-autoscaler).

Alternative method of disaster recovery

In the case If there was no trashcan archive or any other (old) etcd snapshots available we do backup our cluster also with velero. However restoring from velero means creating a new cluster and this can have various other implications and is definitely more time consuming.

We are also fully declarative (gitops style) and could reinitialize a new cluster from these specifications. However there are still some legacy applications which are not yet declarative. Also another point I realized after this outage was I would not be able to restore the sealed secrets as the private encryption keys would be lost forever (These need to be separately exported just in case both velero and etcd snapshots fail). That said some apps store state in kubernetes directly which is obviously not declarative and would be lost.

Other clusters

This lifecycle s3 change was also introduced to other clusters in January. Meaning since January our clusters are q ticking time bomb. Any process interruption, node interruption or manual restarts would have killed them.

On these etcd-manager leader control plane nodes the same logs are found during the reconcile loop regarding the reseeding of the cluster after a restart. I created these control files manually in the s3 bucket. However the etcd-manager did not pick up on these as after analyzing the source code it caches the control specs and only reloads at restart or a leader change. After some tests in other multi node cluster I came to the conclusion it would not reseed the cluster once etcd-manager is restarted as it would find the control files again in the bucket. However until we restart the etcd-manager we are left without backups as this state is cached.

Conclusion

The lifecycle rule should have never been created. It was definitely our own fault. The retention should have rather been configured natively via kops, see https://kops.sigs.k8s.io/cluster_spec/#etcd-backups-retention.

That said I also don’t expect such import control files being underneath a backup path. Also kops did not warn me about that the update will A. kill the cluster and B. there was no recent backup.

That these files will be recreated was visible in the kops dry run (which I only recognized while writing this document). But even if I saw this before applying it I would not have realized that this will kill etcd.

Will create resources:
  ManagedFile/etcd-cluster-spec-events
    Base                    s3://<bucket-name>/<cluster-name>/backups/etcd/events
    Location                /control/etcd-cluster-spec

  ManagedFile/etcd-cluster-spec-main
    Base                    s3://<bucket-name>/<cluster-name>/backups/etcd/main
    Location                /control/etcd-cluster-spec

We have been extremely lucky there was no process interruption in production this far and I actually did some cluster changes today which triggered this on the staging cluster first.

raffis commented 5 months ago

Any comment on this 🙏🏻 ?

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kops/issues/16416#issuecomment-2408942252): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / kops

Warn if a new etcd cluster is seeded, move control files, report the last backup snapshot taken #16416

Now that said I'm pasting the postmortem here which explains our mistake:

Summary

Trigger

Resolution and root cause

Alternative method of disaster recovery

Other clusters

Conclusion