small-hack / smol-k8s-lab

☁️ CLI & TUI with a smol friendly vibe to get started with Kubernetes on metal, then bootstrap apps using Argo CD 🧸 Great for testing webapps and benchmarking.
https://small-hack.github.io/smol-k8s-lab/
GNU Affero General Public License v3.0
12 stars 1 forks source link

fix up zitadel backups/restores and always store the zitadel admin sa secret in bitwarden if possible #260

Closed jessebot closed 2 weeks ago

jessebot commented 2 weeks ago

So, we hadn't actually tested a full restore of zitadel, only backups before. We discovered that we crucially need to save the zitadel admin service account private key that zitadel generates, because it won't generate another one. Due to that, we now immediately store that key in bitwarden, so that users can't safely destroy the whole namespace and still restore. We also found some other minor bugs with the zitadel restores including calling the restore functions improperly, and using an old feature branch for restoring secrets.

We also now clear the local seaweedfs postgresql bucket after we restore anything using cnpg operator for postgresql. This affects nextcloud, matrix, zitadel, and mastodon. We do this, because if we don't future backups after that restore will fail, as the backup process expects that bucket to be empty, which is really weird and confusing. You'll still always have remote backups, so it's not a huge deal, but this hack is the only way outside of keeping track of the backup bucket name and renaming it, which can get out of hand, as we'd need to store cache or look up the bucket each time we did backup, and that feels bad?

We've managed to restore a production cluster, but only from a manual backup that happened on May 10th, which is pretty bad :/ We'll have to figure out why our backups were failing nightly, and then the next task should really be setting up proper alerting to matrix so we get alerts when backups are failing :facepalm: Will report more details after we fix backups Update: it was because we didn't clear the cnpg backup bucket properly before, but we've fixed that now :)