vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.77k stars 1.41k forks source link

Investigate bringing the CSI plugin to GA #4649

Closed eleanor-millman closed 8 months ago

eleanor-millman commented 2 years ago

Describe the problem/challenge you have Currently the Velero CSI plugin is beta. This is for at least two reasons: 1) We don't want customers to accidentally rely on non-durable snapshots: Some snapshots are durable and some are not. For example, the current vSphere snapshotting mechanism takes the snapshot, but then leaves it wherever it is taken. This is non-durable and if we don't move it and the user doesn't realize it stays in the same place, then if the user loses their primary storage, they will lose their backup as well, which is obviously really bad. So, the Velero Plugin for vSphere has a data mover component to move the backup to a different storage location (hopefully, if the user set it up correctly). On the other hand, when an AWS EBS snapshot is triggered, either through the Velero CSI plugin or the Velero AWS plugin, EBS moves that snapshot to a different storage behind the scenes and just returns a snaphshot ID to Velero, so Velero doesn't have to do any data movement in that case. If customers use it on a system that doesn't move the snapshot after it is taken, then they may think they have a durable backup when they actually do not. We are working towards a solution towards this, which is moving the Velero Plugin for vSphere data mover to Velero proper, so eventually we can have the Velero CSI plugin move the snapshot as well as trigger it if needed. But to complete this work will require more design (and then, of course, actual implementation of the data mover bits).

2) There may be parts of the CSI plugin that need to be improved or fixed. Since none of the existing Velero team has worked on the CSI plugin, we don't have the background knowledge that past engineers have.

There are several reasons to take CSI to GA: A) More and more systems will start implementing CSI snapshotting and we would like to be able to support them. B) Allowing Velero to handle credentials to clouds like AWS and Azure is a security risk. If the CSI plugin is used instead, it uses the cloud provider's identity management solution to handle credential rotation automatically in most environments, including EKS, Tanzu, and OpenShift.

Describe the solution you'd like An investigation of both points 1 and 2 above. For 1, validating that the data mover can indeed help make snapshots durable and for 2, diving into the CSI plugin, testing it, looking through filed bugs, etc, to get to know the plugin better.

Suggested roadmap: a. Do investigation (detailed in this issue) b. Bring CSI plugin to GA for AWS and Azure (where we know snapshot movement happens under the covers) c. Work on attaching data movement to CSI plugin for other platforms so we can GA the plugin for them

Note that this issue focuses on a and b above. The data mover work needs to be completed before we do c above and we will open a separate issue when we are ready to do that work.

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

pradeepkchaturvedi commented 2 years ago

👍

eleanor-millman commented 2 years ago

@dsu-igeek noted that this will remove Velero needing volume snapshot location (block storage) creds, but Velero will still need to have backup storage location (object storage) creds.

eleanor-millman commented 2 years ago

@shawn-hurley Notes that this may not automatically enable automated credential rotation, but at least it doesn't force a platform that already had automatic credential rotation to step away from that to use Velero (in its current state).

eleanor-millman commented 2 years ago

@dsu-igeek Note that CSI snapshotting leaves records in the K8s clusters, which is messy, so we need to stop doing that. Also there is an issue if we do CSI snapshot in a namespace, then delete the namespace, then if you want to delete the backup that contains the CSI snapshots, there is no place for the snapshot records to be created since the namespace is not there anymore. Another issue to worry about.

shawn-hurley commented 2 years ago

I think that this makes a lot of sense to move this to GA!

eleanor-millman commented 2 years ago

A follow up regarding which creds this removes the need for (from an AWS expert):

We should ideally remove both the object store and block storage creds, but removing EBS snapshot creds is a good first step, as it's a potentially dangerous permission. Object storage policies can be tailored much more easily and blast radius limited. For snapshots, it's more "all-or-nothing".

reasonerjt commented 2 years ago

b/c the work for data movement is tracked in a separate issue #3229

The focus of this issue should be to understand the gap for the CSI plugin to GA on the env that DOES NOT require data movement.

@eleanor-millman does it sound correct to you? ^

gman0 commented 2 years ago

I believe https://github.com/vmware-tanzu/velero/issues/3544 to be relevant for this as well.

eleanor-millman commented 2 years ago

@reasonerjt Absolutely correct. I will add this in the description. Thanks for calling this out!

eleanor-millman commented 2 years ago

@gman0 Thanks for calling that issue out! In general, one of my asks for this investigation is to go through all open CSI plugin issues (both feature requests and bugs) and relate them as needed. I appreciate you highlighting that issue in this one.

reasonerjt commented 2 years ago

@gman0 Thanks for chiming in. Since issue #3544 requires data movement (bare-metal cluster to S3) it is out of the scope of this issue.

blackpiglet commented 2 years ago

Put the potential issues to resolve in 1.9 for GA CSI plugin of public cloud providers, e.g. EKS, AKS and GKE.

blackpiglet commented 8 months ago

Close this issue, because the data mover function is ready.