vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.52k stars 1.38k forks source link

Using CMEK on GKE Velero does not seem to encrypt the restored cluster with correct key - same for other cloud provider #6982

Open smoms opened 10 months ago

smoms commented 10 months ago

I am running Velero on GKE.

I do have 2 clusters, and I am using 2x StorageClass on each one. This one of the first cluster (redacted):

name: csi-gce-pd-cmek
  resourceVersion: "39305929"
  uid: ba3246c2-ead1-4acf-94a5-560b76167b80
parameters:
  disk-encryption-kms-key: projects/myprj/locations/europe-west1/keyRings/keyring-test/cryptoKeys/key-keyring-test
  type: pd-standard
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

and this one on the second cluster:

  name: csi-gce-pd-cmek-secondary
  resourceVersion: "2157798"
  uid: 17bb8a72-fe80-44c9-b2c3-f1002db72270
parameters:
  disk-encryption-kms-key: projects/myprj/locations/europe-west1/keyRings/keyring-secondary-test/cryptoKeys/key-secondary-test
  type: pd-standard
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

I have create 2x GCP CMEK keys as you can see above, and passed to the StorageClass (param: disk-encryption-kms-key).

I run a simple workload (ngnix) on the first cluster using a PVC. PVC is bound to a PV with first StorageClass. I can write data in the PV.

Then I make Velero backup on the primary and restore on the secondary. Before the restore on the second cluster I also apply this configMap to map the two StorageClasses:

apiVersion: v1
kind: ConfigMap
metadata:
  name: change-storage-class-config
  namespace: velero
  labels:
    velero.io/plugin-config: ""
data:
  csi-gce-pd-cmek: csi-gce-pd-cmek-secondary

At the end of the restore process I can correctly see data restored on my secondary cluster.

My problem is the GCP CMEK though!

When I look in the disk on the node where ngnix run on the first cluster I see that the disk is encrypted with Customer managed key. image

Whereas when I look in the disk on the node where ngnix run on the second cluster I see that the disk is encrypted with Google managed key (not CMEK!!). image

I am surprised since PVC, PV and StorageClasses all seems correct. Their yaml files and their status all look fine. Data are available but I suspect that something is wrong because GCP is not showing that the second cluster'node disk is encrypted with CMEK (key-secondary-test).

Also another parameter supporting my analysis is the following:

Protected Resource is 0: image

Whereas on the first cluster this metric is not 0: image

Is it a bug of Velero or why is this happening?

Environment:

smoms commented 10 months ago

any update on this pls?

reasonerjt commented 10 months ago

Since the yaml of the resource after restoration looks correct, I think velero's handling it correctly. We'll need to look into this to clarify whether there's any setting at GCP level that confused it.

smoms commented 10 months ago

is there a way in the meantime I can do to check if the data are actually encrypted with CMEK?

blackpiglet commented 10 months ago

I didn't find a way to directly verify whether the data in the disk is encrypted by the CMEK. Maybe you can try to delete the CMEK first, then try to mount the disk. I think, if that fails, it means the data is protected by the CMEK.

I also reproduced this issue in my environment, but I didn't find an easy way to fix it.

First, I think this is the default behavior of GCP. If the new encryption key is not specified in the API, creating a disk from an encrypted snapshot will not use any new CMEK. https://cloud.google.com/compute/docs/disks/customer-managed-encryption#remove_encryption

Second, if we want to fix this issue by specifying the CMEK while creating the volume from the snapshot, we need to modify the VolumeSnapshotter interface. encryptionKey should be added to the function CreateVolumeFromSnapshot input parameter list. This may be a common issue for all Velero cloud provider plugins.

type VolumeSnapshotter interface {
    // Init prepares the VolumeSnapshotter for usage using the provided map of
    // configuration key-value pairs. It returns an error if the VolumeSnapshotter
    // cannot be initialized from the provided config.
    Init(config map[string]string) error

    // CreateVolumeFromSnapshot creates a new volume in the specified
    // availability zone, initialized from the provided snapshot,
    // and with the specified type and IOPS (if using provisioned IOPS).
    CreateVolumeFromSnapshot(snapshotID, volumeType, volumeAZ string, iops *int64) (volumeID string, err error)

    // GetVolumeID returns the cloud provider specific identifier for the PersistentVolume.
    GetVolumeID(pv runtime.Unstructured) (string, error)

    // SetVolumeID sets the cloud provider specific identifier for the PersistentVolume.
    SetVolumeID(pv runtime.Unstructured, volumeID string) (runtime.Unstructured, error)

    // GetVolumeInfo returns the type and IOPS (if using provisioned IOPS) for
    // the specified volume in the given availability zone.
    GetVolumeInfo(volumeID, volumeAZ string) (string, *int64, error)

    // CreateSnapshot creates a snapshot of the specified volume, and applies the provided
    // set of tags to the snapshot.
    CreateSnapshot(volumeID, volumeAZ string, tags map[string]string) (snapshotID string, err error)

    // DeleteSnapshot deletes the specified volume snapshot.
    DeleteSnapshot(snapshotID string) error
}
smoms commented 10 months ago

Hi @blackpiglet thanks for your analysis. I follow your logic. What surprises me though is that Velero is supposed to create a volume from snapshot. That snapshot (I assume) is not encrypted (with the primary disk CMEK) anymore because data have been read/extracted and stored somewhere else (e.g. in GCS). At this point, if I restore the data I use a new StorageClass with another CMEK. So all in all the restored data/PV - which are showing to correctly use the new StoageClass - should be created with a new encryption disk. This is how K8s and GKE should work. Keys should be instead present and passed to the API also to the secondary cluster during a restore. Why this does not working is no clear to me.

blackpiglet commented 10 months ago

@smoms https://cloud.google.com/docs/security/encryption/default-encryption#what_is_customer_data I think the snapshot encryption is reasonable, because, if the data on the disk is encrypted by CMEK, but the snapshot is not. If the snapshot is shared, then others can read data from the disk by creating a disk from the snapshot and mount it somewhere.

Velero backing volume data methods can be separated into two categories:

I forgot to ask the way you back up volume data. If nothing is specified, the default way is Velero native snapshot, which uses the cloud provider plugin's VolumeSnapshotter action to take snapshots by calling the cloud provider API.

smoms commented 10 months ago

@blackpiglet I am using the velero-plugin-for-gcp. I am not using the file system backup. About the snapshot process I do not want to insist but I think that once taken a snapshot is extracted from the disk i.e. read by Velero and therefore it must be unencrypted by the driver. After all it is data that is read from the disk and therefore as any other read operation it gets transparently unencrypted by the driver (using the CMEK in this case). Of course Velero needs permissions to access the disk. If this is not the case then I would not be able to see the same data correctly restored on the other cluster. I am also sure that on the first cluster data are encrypted because I see the disk being encrypted with CMEK there.

blackpiglet commented 10 months ago

@smoms I want to clarify whether the snapshot created from a CMEK-protected disk should be encrypted at rest or not is not decided by the Velero code.

It's the GCP's decision. Please check the following GCP document. https://cloud.google.com/compute/docs/disks/customer-managed-encryption#create_snapshot

About your concern that data cannot be restored correctly in other clusters, I think that will not happen. I don't how GCP makes it work in detail, but I think GCP should decrypt the snapshot first with the CMEK, then do the following job. https://cloud.google.com/compute/docs/disks/customer-managed-encryption#remove_encryption

smoms commented 10 months ago

@blackpiglet Ok then assuming GCP is creating an encrypted CMEK snapshot and then in the restore process it applies the default encryption, I still think there is a bug somewhere. Because the GKE on the other cluster is showing that the restored PV is using a new StorageClass (linked to another CMEK). But in effect GCP is also showing that disk using a Google-managed key (I assume this is the same as default encryption). You would agree with me that at minimum this is a wrong or confusing information.

smoms commented 10 months ago

just an another info: if I create a pod directly (not with Velero) from the secondary cluster using the same StorageClass that Velero is supposed to use in the restore process, I indeed get a CMEK encrypted key disk as expected. So to me it seems that the Velero - with gcp plugin - when restoring using a different CMEK key does not work as expected.

blackpiglet commented 10 months ago

@smoms Yes. I agree that the restored volumes not using the StorageClass specified CMEK is not expected. Please see my previous comment. Adding new parameters in the Velero plugin's interface will make the new version Velero server not work with the older version of Velero plugins. I think we need to see the other maintainers' opinions on how to fix it, and whether the break change is acceptable.

Second, if we want to fix this issue by specifying the CMEK while creating the volume from the snapshot, we need to modify the VolumeSnapshotter interface. encryptionKey should be added to the function CreateVolumeFromSnapshot input parameter list. This may be a common issue for all Velero cloud provider plugins.

smoms commented 10 months ago

I think we can close it, hoping for a fix in Velero then.

yanggangtony commented 10 months ago

@blackpiglet

I think we need to see the other maintainers' opinions on how to fix it, and whether the break change is acceptable.

Is that a possible to add v2 to the CreateVolumeFromSnapshot, that will not affect to the user?

blackpiglet commented 10 months ago

Yes. I think that is the way to fix this issue, but introducing a new version of plugins is not a small change, so maybe we need to do more things than adding a parameter for it.

smoms commented 10 months ago

what could be a workaround in the meantime? does not seem so simple..

blackpiglet commented 10 months ago

@sseago @reasonerjt @Lyndon-Li Could you share some insight on this issue?

The scenario is the volume that is protected by CMEK backed up, then is restored with a StorageClass with a different CMEK. Unexpectedly the restored volume is not protected by the CMEK specified by the restore StorageClass, it's protected by the Google-managed key. The reason is analyzed here https://github.com/vmware-tanzu/velero/issues/6982#issuecomment-1782578207.

Looks like this is a common issue for all Velero cloud-provider plugins.

Is it possible to have a temporary solution without modifying the VolumeSnapshotter plugin interface?

smoms commented 9 months ago

Hi again, is there any ETA I can rely on?

blackpiglet commented 9 months ago

We will discuss it and reply to you.

blackpiglet commented 9 months ago

@smoms After discussion, there is no workaround to fix it yet. The plan is to add a generic way to configure the VolumeSnapshotter behavior in the v1.14 release.

smoms commented 9 months ago

@blackpiglet thanks. So I assume it will be then around Q1 or Q2 2024..

reasonerjt commented 6 months ago

When velero restore a PV from native snapshot it's done via static provision and the settings in the storage class won't take effect.

I think the gap is how to allow user to replace encryption key during restore.

This is a valid requirement but we wanna keep this in backlog and handle it when there're more ask.

blackpiglet commented 6 months ago

Checked with AWS, GCP, and Azure. The snapshot's parent volume customer-managed key mapping with the target customer-managed key should work.

The GCP's API snapshot description resource has the source volume encryption key and the snapshot encryption key. We can create a ConfigMap to map the source and target encryption key.

// Snapshot: Represents a Persistent Disk Snapshot resource. You can use
// snapshots to back up data on a regular interval. For more
// information, read Creating persistent disk snapshots.
type Snapshot struct {
    // SourceSnapshotEncryptionKey: The customer-supplied encryption key of
    // the source snapshot.
    SourceSnapshotEncryptionKey *CustomerEncryptionKey `json:"sourceSnapshotEncryptionKey,omitempty"`

    // SourceImageEncryptionKey: The customer-supplied encryption key of the
    // source image. Required if the source image is protected by a
    // customer-supplied encryption key. InstanceTemplate and
    // InstancePropertiesPatch do not store customer-supplied encryption
    // keys, so you cannot create disks for instances in a managed instance
    // group if the source images are encrypted with your own keys.
    SourceImageEncryptionKey *CustomerEncryptionKey `json:"sourceImageEncryptionKey,omitempty"`
}

The AWS basically works the same as the GCP. The Azure works a bit differently, but the work flow should also apply.

reasonerjt commented 5 months ago

We may use a similar configmap approach or introduce some "restore policy" to make this modification. Leaving it in backlog until we see more use cases.

pschmidbauer commented 4 months ago

Hi, we also have this issue. All our pvcs use a CMEK in Google Cloud and hence, we cannot restore anything. Currently, we are planning a mannual workaround. Will probably work with gcloud commands to create a disk from the snapshot and then bringing it into our kubernetes cluster manually with kubectl commands. This could have effects on our RTO though. We would highly appreciate a fix. Any plans for that already? Thank you!

blackpiglet commented 4 months ago

There is a proposed solution. https://github.com/vmware-tanzu/velero/issues/6982#issuecomment-1925642580 However, the implementation is not planned yet. One of the reasons is this is the Velero-native-snapshot-only issue. If switching to the CSI plugin, it can support this scenario.

MannyA2k commented 4 months ago

Hi, we also have this issue, I was really surprised to find this issue given the popularity of Velero.
Waiting for more complaints to come in before fixing this I feel isn't right, as this is something you're likely not to even notice until you need it, as it was in our case - realising weeks later that we've not been using our CMEK's to encrypt our PV's anymore, after having to rebuild from backups due to a big outage.

It may seem that its ok because we can fall back to Google's Default Encryption key's - but this is a serious matter as we're inadvertently now failing to meet our data security standards, compliance requirements and key management policies of not having our data protected by a key not managed by us - with instead now having a cloud vendor have access and control over our encryption keys and using a shared KEK with other cloud customers - which is the case with the usage of a google managed key.

Also as stated previously, any workaround will cause a massive delay to our RTO's, ive had to use GCP snapshot restores passing the CMEK i need in also.
It would be great if this was given the priority it does deserve. Thanks

DasLeo commented 4 months ago

As we're currently migrating more and more environments into the new GCP Sovereign Cloud, we heavily rely on external CMEK. This is why we'll evaluate alternatives to Velero soon to reduce our RTO.

smoms commented 4 months ago

If it can help, there is a GCP native backup solution for K8s.

pschmidbauer commented 4 months ago

@blackpiglet Any way how we can upvote this topic and get your proposed solution implemented as Velero basically is not usable for us as Manny properly described.

DasLeo commented 4 months ago

Isn't Velero only creating a PVC with the reference to the VolumeSnapshot?

If that's the case and the correct StorageClass (e.g. default one) is being used (which has CMEK enabled and set), I would not expect any issues.

blackpiglet commented 4 months ago

@DasLeo Exactly. For the CSI case, there should be no problem. The reported issues are related to the Velero GCP plugin scenario. The GCP plugin calls the GCP API to create volume. That cannot utilize the StorageClass configuration.

@pschmidbauer I labeled this issue with the 1.15-candidate. The Velero team will go through issues with this label during next release planning.

DasLeo commented 4 months ago

@DasLeo Exactly. For the CSI case, there should be no problem. The reported issues are related to the Velero GCP plugin scenario. The GCP plugin calls the GCP API to create volume. That cannot utilize the StorageClass configuration.

Thanks @blackpiglet for clarification. I already found the proper part in the velero-plugin-for-gcp that lacks the CMEK definition.

Is this GitHub here about the CSI only? If so, I would create a new issue in velero-plugin-for-gcp for our problem that's maybe not related to this issue.

blackpiglet commented 4 months ago

No need to create a new issue. All Velero-related project issues are tracked in this repository.

reasonerjt commented 2 months ago

@blackpiglet Let's see if have a way to implement this without having to update the snapshotter interface. For example is there a way to get the key information via Google's API or we may ask user to pass the key information via bsl/vsl or configmap.

blackpiglet commented 2 months ago

We can read the snapshot encryption key from the GCP snapshot response body, then create the disk with the snapshot encryption key should resolve this issue.

https://github.com/googleapis/google-api-go-client/blob/main/compute/v1/compute-gen.go

// Snapshot: Represents a Persistent Disk Snapshot resource. You can use
// snapshots to back up data on a regular interval. For more
// information, read Creating persistent disk snapshots.
type Snapshot struct {
    ......
    // SnapshotEncryptionKey: Encrypts the snapshot using a
    // customer-supplied encryption key. After you encrypt a snapshot using
    // a customer-supplied key, you must provide the same key if you use the
    // snapshot later. For example, you must provide the encryption key when
    // you create a disk from the encrypted snapshot in a future request.
    // Customer-supplied encryption keys do not protect access to metadata
    // of the snapshot. If you do not provide an encryption key when
    // creating the snapshot, then the snapshot will be encrypted using an
    // automatically generated key and you do not need to provide a key to
    // use the snapshot later.
    SnapshotEncryptionKey *CustomerEncryptionKey `json:"snapshotEncryptionKey,omitempty"`
    ......
blackpiglet commented 1 week ago

During implementation, I found this comment is not accurate, because the snapshot content's key is the original disk's encryption key. There is no way to find the needed key in the second StorageClass.

There are two ways to fix this issue:

Either way is not a good solution. Modifying the interface will make the older version plugin cannot work with the new version of Velero. The ConfigMap is not user-friendly enough to use.

Another thing worthy of consideration is that the Native snapshot does not use StorageClass to provision volume. There is a gap between the StorageClass mapping and the Volume Snapshotter plugins. I think leaving it as it was is another solution because the user cannot expect the StorageClass mapping work here.

We can read the snapshot encryption key from the GCP snapshot response body, then create the disk with the snapshot encryption key should resolve this issue.

https://github.com/googleapis/google-api-go-client/blob/main/compute/v1/compute-gen.go

// Snapshot: Represents a Persistent Disk Snapshot resource. You can use
// snapshots to back up data on a regular interval. For more
// information, read Creating persistent disk snapshots.
type Snapshot struct {
    ......
  // SnapshotEncryptionKey: Encrypts the snapshot using a
  // customer-supplied encryption key. After you encrypt a snapshot using
  // a customer-supplied key, you must provide the same key if you use the
  // snapshot later. For example, you must provide the encryption key when
  // you create a disk from the encrypted snapshot in a future request.
  // Customer-supplied encryption keys do not protect access to metadata
  // of the snapshot. If you do not provide an encryption key when
  // creating the snapshot, then the snapshot will be encrypted using an
  // automatically generated key and you do not need to provide a key to
  // use the snapshot later.
  SnapshotEncryptionKey *CustomerEncryptionKey `json:"snapshotEncryptionKey,omitempty"`
    ......