seaweedfs / seaweedfs-csi-driver

SeaweedFS CSI Driver https://github.com/seaweedfs/seaweedfs
Apache License 2.0
225 stars 52 forks source link

VolumeSnapshot support #179

Open Givemeurcookies opened 2 months ago

Givemeurcookies commented 2 months ago

Related to #79 Is there any plans to add VolumeSnapshots and if so, is there an ETA for it?

Currently there's quite a lot of workaround to do i.e backups Volsync + Restic and direct copy - which mounts the PVC and copies the data "manually" over to a new PVC or to an external storage for backup.

Whenever an application is being updated there's always a risk that it can break something which requires a rollback of both the application and the storage. To guarantee that the backups are consistent, some application has to be shut down completely and sometimes even needs root access to copy all the files.

Compared to how i.e Restic direct copy works - VolumeSnapshots are faster, more secure (as it doesn't need a root pod, this also reduces the chance of permission issues when restoring), gives a better consistency guarantee since the whole PVC is copied and it requires less resources. It becomes increasingly important the more applications you need to backup in the system. Replication is also not a failsafe and backups are required in a production system, i.e if the storage system/network gets fully saturated for some reason and data is not able to be synchronised.

There's also the reason that most other CSI's now support VolumeSnapshots and even PVC cloning, it's overall just easier to work with in a Kubernetes native way and a lot of backup tools for Kubernetes have put their most of their development efforts the last few years on VolumeSnapshots.

chrislusf commented 2 months ago

You may want to check this feature: https://seaweedfs.com/docs/admin/features/#point-in-time-recovery . It basically can continuously backup data and it can create a snapshot to any point of time.

Need to investigate on how to better fit Kubernetes model, especially on how to restore a snapshot.

Givemeurcookies commented 2 months ago

I've looked at the PIT recovery, however from what I see this is not available in the free plan, I believe Seaweedfs needs to make it easier to support a 3-2-1 backup strategy in Kubernetes native way. There are countless of examples where it's been critical, i.e recently when Google deleted the Australian pension fund - what saved them was that they had a backup at another provider.

Just as a concrete example (and also most likely a future ticket) on why VolumeSnapshots or a Kubernetes native way to back up the Seaweedfs is important: We run a Talos cluster where 3 nodes run Seaweedfs with 1 master, 3 filer and 3 volumes + S3 storage tier externally. We had an issue specifically yesterday where only one node was wiped during an upgrade (default for Talos unless you set a flag to preserve the filesystem during upgrades) and it seems like this caused a bad state of Seaweedfs that was unrecoverable and we weren't able to recover it even after a full reinstall of Seaweedfs. We use the Rancher local storage provider and I don't believe this is something it is meant to handle since the node state is the same but the whole underlying storage is gone. I believe the issue is caused by a mix of several components, one being that the Rancher Local Storage provisioner isn't able to delete the storage (even after deleting the PVC+PV) because it's helper needs privileged access to do that in Talos (as it uses strict pod admission policies by default) - the helper used to have privilege before the Rancher Local Storage provisioner 0.26.0 update. So I believe Seaweefs partially recovered some data after the wipe which then was present during the reinstall (we let Seaweedfs run 5-6 hours before doing the reinstall). Secondly, we have a S3 storage tier (edit: after wiping, it still didn't fix the issue) that we didn't wipe after the reinstall which might also cause issues. Thirdly, we had only set defaultReplication 003 for the master but not for the filer (I'd really appreciate an example in the helm chart for a production ready setup, since these things are easy to miss the first time around), so dataloss should be expected.

My biggest concern however is that Seaweedfs didn't report any errors and seemed to work fine based on logs, we only knew it was an issue since we a mix of errors socket not connected, permission denied, transport endpoint is not connected on the pods trying to use the PVC's (both cluster.check and cluster.ps doesn't report any oddities). This was with a "fresh" PVCs that shouldn't have been affected by the old PVC's. This was also for the node that we had ran the upgrade command on, I sadly can't recall if we had similar issues with the other nodes but I'll keep it in mind when debugging. We checked the network and amount of sockets and neither was saturated, cilium was also able to connect to all the other nodes, no network policies were set up and we turned off the firewall. We also reduced the amounts of workloads that used Seaweedfs so we only had one application that doesn't have a lot of disk activity. The Seaweedfs command to check the volume integrity (volume.check.disk) simply output an empty response. Eventually we got the error lookup error nil <somestorageid> on one of the volumes, however we suspect this is irrelevant as after reinstall we don't get the same error.

A Kubernetes native method of backing up and restoring the whole Seaweedfs system would most likely get us out of the bad state (which I believe is on the roadmap for the operator) and then using restore with VolumeSnapshot would weed out the permission issues as it would restore the whole PVC on a block level with the correct permissions (which the Direct copying method of i.e Restic can't guarantee).

I'll continue to debug the issue, we have wiped the S3 storage tier, will try to use a different storage backend than Local Storage Provisioner and I plan to reset the other nodes that Seaweedfs runs on today to see if we're able to recover it. We have a unique opportunity to do some chaos testing and debug the issue since the cluster is set up declaratively and is made to be prod ready with logging and metrics on every system but doesn't run any critical workloads yet. If I get enough time to figure it out, I will create a separate ticket (I feel like there's too many variables I need to weed out before I can give a create a decent ticket), I will hopefully be able to create something reproducible and figure out how to recover it. You can contact me by email (listed in my profile) if you want more information/assist over a chat like Signal/Discord/Matrix etc.

edit: After a bit thinking, I see I mixed Restic/S3 external backup with Snapshots/cloning, it's inaccurate on my part - they're complimentary to each other and serve different purposes. Seaweedfs also has other backup methods that should work for Kubernetes. I've revised the original comment a bit to be more accurate.

chrislusf commented 2 months ago

First, thanks for the detailed information! There has been many issues created with just an error message, which shows no context at all.

It'll be nice to have a reproducible case. Usually for SeaweedFS cluster problem, I would recommend using docker compose. But for this csi driver, I am not sure what is the best approach.

For a reliable backup, you can use: SeaweedFS_cluster == weed filer.sync ==> SeaweedFS_backup_cluster

Givemeurcookies commented 1 week ago

To follow up, the issue might have been caused by Local Storage Provisioner v0.29.0, they made the assumption that node's name == its hostname, which is not true for all k8s distributions (such as Talos before v1.8.0) this was reverted in v0.30.0. From my understanding, this meant pods could reschedule on nodes that didn't have the data/PVC available due to PVC affinity not working properly, that meant the PVC would still be "mounted" - which meant any workload that was rescheduled wouldn't have access to any data but would still be started. I encountered issues later with some of our Postgres clusters using CloudnativePG also getting similar errors causing partial PG cluster failure (which also does software replication) requiring us to delete the PVC's to fully recover the PG cluster. The Seaweedfs and Postgres errors seemed grounded in somewhat the same issue however we couldn't recover Seaweedfs like we did with Postgres. I don't know enough about Seaweedfs to tell, but I assume it could be due the master/slave architecture where some of the metadata still existed and the workload/pod associated with that metadata was still reporting fine or not understanding that the underlying PVC was "corrupt". I'm not sure why this would cause socket not connected and other more network related errors though, that was unique to Seaweedfs and Postgres did not report any network issues.

Also why reinstalling the whole Seaweedfs cluster didn't work is also something that is a bit weird. I know that Postgres has a "join" job that creates the PVC and prepares it before starting the database. Which meant that deleting the PVC wouldn't work to recover the cluster if the join job and main workload was scheduled on different nodes. Could Seaweedfs have something similar?

I believe you could replicate it using Talos v.1.7.4 and Local Storage Provisioner v0.29.0. Talos also has a command to spin up a cluster locally but it doesn't support the upgrade API when ran in a container which would be needed to test if the upgrade caused the issue. To test upgrading locally, it would require a more complex setup by using either QEMU or VirtualBox, Talos has a doc on how to set it up locally.

Either way, seems like the issue we experienced might not have been a bug in Seaweedfs specifically, but rather a problem with the underlying system. We will most likely give Seaweedfs another try at some point during the next 6-12 months and I will most likely have more information on it then. If you're able to replicate it in the meantime, I believe it could help to make the system more robust by adding some sanity checks for these type of edge cases.