vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.69k stars 1.4k forks source link

JSON patch of Restore resource modifiers failing when referencing fields from `status` #8094

Open BaudouinH opened 2 months ago

BaudouinH commented 2 months ago

What steps did you take and what happened:

I am trying to restore volumesnapshots and volumesnapshotcontents from a backup from another GKE Kubernetes cluster using Velero.

Following GKE's documentation to import a pre-existing snapshot, some fields of the resources need to be modified to be successfully imported.

For example, for the initial VolumeSnapshotContent

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  name: restored-snapshot-content
spec:
  source:
    volumeHandle: <pvc reference>
status:
  snapshotHandle: <gcp reference for the snapshot>
(...)

The correct VolumeSnapshotContent definition, to import it on another cluster, would be:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  name: <name>
spec:
  source:
    snapshotHandle: <gcp reference for the snapshot>
status:
  snapshotHandle: <gcp reference for the snapshot>
(...)

So I need to add the field spec.source.snapshotHandle, set it to the value of status.snapshotHandle and remove the field spec.source.volumeHandle.

To do so, I tried using Restore Resource Modifiers, with the following JSON patches:

version: v1
resourceModifierRules:
  - conditions:
      groupResource: volumesnapshotcontents.snapshot.storage.k8s.io
      resourceNameRegex: "^snapcontent-.*$"
    patches:
      - operation: replace
        path: "/spec/deletionPolicy"
        value: "Retain"
      - operation: copy
        from: "/status/snapshotHandle"
        path: "/spec/source/snapshotHandle"
      - operation: remove
        path: "/spec/source/volumeHandle"

Which I put in a configmap, restore-modifier as shown in the doc.

I then triggered a restore by creating the following Restore resource through the Kubernetes API:

apiVersion: velero.io/v1
kind: Restore
metadata:
  name: <name>
  namespace: velero
spec:
  backupName: <backup name>
  includedNamespaces:
  - <namespace>
  includedResources:
  - volumesnapshotcontents.snapshot.storage.k8s.io
  includeClusterResources: true
  restoreStatus:
    includedResources:
    - volumesnapshotcontents.snapshot.storage.k8s.io
  existingResourcePolicy: update
  resourceModifier:
    kind: configmap
    name: restore-modifier

You can note that I specifically set spec.restoreStatus to have the status on my resources, as well as spec.includeClusterResources to be able to successfully restore VolumeSnapshotContent.

When doing so, I get a partial failure of the restore: The resource is created but not modified according to the patch. The following error can be seen in Velero's logs:

time="2024-08-07T13:43:49Z" level=error msg="Cluster resource restore error: error in applying patch error in applying JSON Patch copy operation does not apply: doc is missing from path: \"/status/snapshotHandle\": missing value" logSource="pkg/controller/restore_controller.go:587" restore=velero/<name of the restore>

And with the debug command, for each resource that I try to patch:

Cluster:  error in applying patch error in applying JSON Patch copy operation does not apply: doc is missing from path: "/status/snapshotHandle": missing value

Even though the resource created by Velero has the value I'm trying to reference in the patch:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  name: <name of the restored volumeSnapshotContent>
spec:
  source:
    volumeHandle: <pvc reference> # Patch has not been applied
(...)
status:
  snapshotHandle: <gcp volume snapshot reference>

So the patch should work, if I'm not mistaken.

I took a quick look at the code, in https://github.com/vmware-tanzu/velero/blob/main/pkg/restore/restore.go and, if I understand correctly, it seems the resourceModifiers are applied before the logic adding the status to the object being restored.

Since I'm lacking some context on the implementation of the restore feature, I don't know if this behavior is expected or not.

I'm running into the same issue, while restoring VolumeSnapshots, with the same type of JSON patchs.

What did you expect to happen:

Being able to use JSON patches on the status field in restore resource modifiers to modify a restored resource field by using values from the status field.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

reasonerjt commented 2 months ago

This happens b/c the json patches were applied after removing the status field. https://github.com/vmware-tanzu/velero/blob/cd601cafdf5d62633e252b13853b79a57138ab61/pkg/restore/restore.go#L1299

Maybe we can resetMetadata here and resetStatus after the jsonpatches are applied? I assume no RIA will try to modify the status field??

Above all, we need to avoid causing regression...

jacksgt commented 2 months ago

Hello, thank you for opening this issue - I was just debugging the same behavior for a few hours! I would definitely be in favor of allowing the use of status fields in Restore Resource Modifiers.

BaudouinH commented 2 months ago

Maybe we can resetMetadata here and resetStatus after the jsonpatches are applied? I assume no RIA will try to modify the status field?

I think this would be the most straightforward implementation, and as long as this limitation is highlighted in the documentation it is fine from my end user point of view. I can open a PR if needed.

Making it possible to modify the status as well would be more complex but may be useful for some end users?

Above all, we need to avoid causing regression...

Which type of regression do you have in mind? Since we would still remove this status field by default, just later in the process

reasonerjt commented 2 months ago

@BaudouinH

Above all, we need to avoid causing regression...

Which type of regression do you have in mind? Since we would still remove this status field by default, just later in the process

This is not likely to happen, but an extreme example is that there could be an RIA which panics if the status field is not nil.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.