siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
551 stars 32 forks source link

[feature] add deduplication, pruning, etc to `etcd` backups #610

Open judahrand opened 1 month ago

judahrand commented 1 month ago

Problem Description

Sometimes cluster state does not change for an extended period. It is wasteful in these cases to store an entire snapshot of etcd in object storage for every backup. Additionally, given that there is no removal mechanism the amount of storage used is unbounded.

Solution

Use a proper backup tool like restic to handle the backups. This would add deduplication, pruning, expiry etc to the backups.

Alternative Solutions

No response

Notes

No response

smira commented 1 month ago

You can control the lifetime of the backup yourself by attaching S3 Bucket Lifecycle policy to your bucket, Omni doesn't enforce that for you, it's totally your choice.

As each backup is encrypted, you can't use "easy" tools to compare backups.

judahrand commented 1 month ago

You can control the lifetime of the backup yourself by attaching S3 Bucket Lifecycle policy to your bucket, Omni doesn't enforce that for you, it's totally your choice.

This is true but it doesn't support more complex schemes that proper tools do like "Keep one hourly backup, one daily backup and 1 weekly backup".

As each backup is encrypted, you can't use "easy" tools to compare backups.

Yes, in the current implementation this is true. But backup tools like restic fully support encryption alongside deduplication. I'm not sure why the encryption couldn't be left to the backup tool with the encryption key provided by Omni?

It just strikes me that the right tool for a backup job is a backup tool 🤷‍♂️ If Omni used something like restic then Omni still wouldn't actually have to handle the expiry. That could be left up the user to orchestrate on the backup repository. Many of these tools are designed to maintain their state in dumb storage so all a user needs is the correct client (and the password/encryption key) to interact with a given backup repository. To be clear other 'enterprise-y' tools do use restic TrueNAS for example uses restic to support their new TrueCloud Backups.

smira commented 1 month ago

There are certainly multiple ways to implement any single feature, and each implementation has its own pros and cons. Omni is not only a homelab solution, so certainly a set of different requirements than TrueNAS is also there.

judahrand commented 1 month ago

It may well be that something is being lost in communication as it is often tricky to convey tone in writing but I feel like your response to this feature request has come off as quite dismissive and superior. I'm not trying to be difficult and as I've tried to make clear by already contributing to siderolabs/pkgs and siderolabs/extensions I'm here to be a contributing member of the community who uses your software at home and, hopefully, encourages employers to do the same!

I think iXSystems would beg to differ that TrueNAS is primarily a solution. That argument is no different than suggesting that Talos/Omni is a homelab solution because some hobbyists use it. Omni even has a 'Hobby' tier of which I am a customer! I'd agree that Omni is not a NAS solution, however, if your are offering backup functionality it doesn't seem silly to suggest that a backup tool is used does it? I also don't think that better backup functionality is something which is either completely out of scope of Omni nor something that enterprise customers would be uninterested in.

I can see that it might not make sense for Siderlabs to put development time or effort towards it as there are probably things which enterprise customers would value more highly. But that doesn't mean that someone from the community might like to contribute such an improvement. That is surely a big part of the benefit of BSL/Source Available codebases?

If the answer here is "No, that feature isn't something that Siderolabs would be interested in having Omni support" then that's fine of course - your codebase your rules. But I'm not sure you've said that nor given a coherent reason that such a feature either couldn't work or wouldn't be useful.

smira commented 1 month ago

If you scroll back, I offered you potential solutions for the issues you posted:

The backup encryption the way it is implemented has some set of good things - it's zero configuration for the user (except for providing S3 bucket access). Relying on any external tooling means that you need a power user capable of configuring that external tool. That is unlikely to change (zero configuration philosophy).

Omni doesn't limit you in the way you want to do backups - you can do it from within the cluster itself using Talos API via Kubernetes access, or using talosctl externally. You can build your own flow which works for you.

smira commented 1 month ago

And one more thing - contributing something upstream doesn't mean it's zero effort for the upstream. Supporting any extra feature, extension, etc. is a huge amount of effort we have to put in, so we are cautious on accepting any new feature or addition.

judahrand commented 1 month ago

And one more thing - contributing something upstream doesn't mean it's zero effort for the upstream. Supporting any extra feature, extension, etc. is a huge amount of effort we have to put in, so we are cautious on accepting any new feature or addition.

Absolutely, I'm fully aware of this and I agree that merging an unmaintainable feature is worse than no feature at all. Plus, there is even overhead to putting resources towards reviewing contributions. I'm certainly under no illusions that contributing a few lines to enable a kernel module via an extension gives me any authority or is some fantastic, selfless act.

If you scroll back, I offered you potential solutions for the issues you posted:

Sure - I'm not disagreeing that that solution addresses part of what I'd like to be able to achieve and as I say, perhaps, something is being lost in communication (in both directions).

The backup encryption the way it is implemented has some set of good things - it's zero configuration for the user (except for providing S3 bucket access). Relying on any external tooling means that you need a power user capable of configuring that external tool. That is unlikely to change (zero configuration philosophy).

I'm not actually sure that that is any different to what I'm suggesting. All that restic needs is a bucket and an encryption key. The encryption key could still be generated within Omni and used to initialize the restic repository. I don't think this is any different to how it works now?

The only addition to this that might be nice if restic were used would be to allow the (power) user to extract the encryption key from Omni in order to manage the restic repository externally. And, to be fair, that might be nice to have even now! What if one needs/wants to decrypt an etcd snapshot but for whatever reason they can't access Omni?

I solution which uses a more sophisticated backup tool would:

I realize that I've focused on restic as a specific tool and that is really only because I am familiar with it and it seems to fit the use case quite well.

judahrand commented 1 month ago

If the answer to this feature request is definitely "No, beyond the scope of Omni" maybe it is best that we leave this conversation here and close the issue? I, honestly, can't see a UX or technical reason why what I'm suggesting isn't valid or is a bad idea. But the argument against implementing a feature outside of core functionality is totally reasonable (though one does have to pose the question that if that is the case why are etcd backups implemented at all?)

smira commented 1 month ago

The etcd backup is double-encrypted - by Omni and by Kubernetes, so even if you decrypt it from Omni encryption, it's still encrypted (Secrets), so quite not useful still.

The backups serve an exact purpose - to recover from hard controlplane failures, which Omni automates for you by having control of the backups and exactly the way they are encrypted. The second use of backups is restoring to a new cluster to do point-in-time recovery, but that is way less frequent.

So 99% of the time you just need the latest backup to recover the controlplane. If you have some other usecases for backups, let's discuss them, and probably taking those backups yourself, or using talos-backup might be a better idea.