vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.69k stars 1.4k forks source link

Post hook command is executed before the Kopia snapshot is completed #8159

Open QuentinBtd opened 2 months ago

QuentinBtd commented 2 months ago

What steps did you take and what happened:

I am using Velero to back up a PostgreSQL dump created by a pre-hook command, so that Kopia can backup the dump, then delete the file using a post-hook command. However, I noticed that the post-hook was executed before Kopia had the chance to back up the dump.

I conducted several tests:

What did you expect to happen:

Dump file should be in Kopia snapshot

Anything else you would like to add:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: my-app-pg-app
  namespace: velero
spec:
  schedule: "0 4 * * *"
  useOwnerReferencesInBackup: false
  template:
    includedNamespaces:
      - my-app
    labelSelector:
      matchLabels:
        cnpg.io/cluster: pg-app-16
    metadata: {}
    ttl: 720h
    hooks:
      resources:
        - name: dump
          pre:
            - exec:
                container: postgres
                command:
                  - /bin/bash
                  - -c
                  - |
                    set -e
                    echo "/pgdata/*" > /var/lib/postgresql/data/.kopiaignore &&
                    pg_dump -d app -f /var/lib/postgresql/data/dump_app-$(hostname)-$(date '+%Y-%m-%d-%H-%M-%S')
                onError: Fail
                timeout: 5m
          post:
            - exec:
                container: postgres
                command:
                  - /bin/bash
                  - -c
                  - |
                    rm -f /var/lib/postgresql/data/dump_app-$(hostname)-*
                onError: Fail

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Lyndon-Li commented 2 months ago

Are you using data mover backup?

QuentinBtd commented 2 months ago

Are you using data mover backup?

Nop

reasonerjt commented 1 month ago

Please double check if the issue still exists after the introduction of itemblock in v1.15

thommeo commented 1 month ago

Seeing this issue as well, same use case

varac commented 1 month ago

Same here using velero 1.14.1 (restic instead of kopia). The backup pre-hook is configured to scale down the clickhouse statefulset, and the post-hook is executed right after the pre-hook, while the clickhouse backup job is still running. This seems to be a regression with 1.14, because it failing consistently right after the upgrade from velero 1.13 to 1.14

sseago commented 1 month ago

@reasonerjt

Please double check if the issue still exists after the introduction of itemblock in v1.15

I don't think that will change anything. Especially with an itemblock that does not have multiple pods, the post hook will run as soon as the pod backup is completed (which happens after the pvc backup is completed). If this was datamover, I'd understand to a degree -- we may have a bug similar to the one fixed on the restore side where we moved post hooks to happen in finalize when there were async actions involved, since the snapshot (or data movement) might complete after the backup/restore of kube metadata is done. But for fs-backup, I thought we blocked on completion of it before declaring the pod backup/restore done, so this shouldn't be happening here. I guess we'll need to look at what changed between 1.13 and 1.14 in terms of PVB processing.

ywk253100 commented 1 month ago

This issue is caused by https://github.com/vmware-tanzu/velero/pull/7571. Before the change, the backup of pods is in sequence, the backup process doesn't handle the next pod until the last one is processed (all PVBs are processed).

7571 made the processing of pods in parallel, so the hook could be executed before the PVBs are handled.

sseago commented 1 month ago

@ywk253100 On the restore side, we made changes to make sure hooks happened after volume restore was done which involved some of the processing moving to the finalizing phase. I wonder whether we need similar changes on the backup side.

reasonerjt commented 2 weeks ago

@ywk253100 Agreed that there are inconsistencies in terms of the behavior of post-hooks. Please open new issues to track the unclarity, and double check the CSI scenarios whether the sequence of snapshot of execution is expected. This issue is specifically about the fs-backup scenario, and it should be fixed.