vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.42k stars 1.37k forks source link

Presuming User Issue - Cant Debug Data Mover Not Working #7734

Closed B1ue-W01f closed 2 months ago

B1ue-W01f commented 2 months ago

What steps did you take and what happened: Configured rook snapshotting. Tested and confirmed snapshot created

Configured Velero for Kopia and Data Mover backups. Tested and backup completes but doesn't appear to carry out data mover.

velero backup create heimdall-backup --from-schedule heimdall-daily-aurora-backup velero backup describe heimdall-backup

Output is below

Name:         heimdall-backup
Namespace:    velero
Labels:       velero.io/schedule-name=heimdall-daily-aurora-backup
              velero.io/storage-location=default
Annotations:  argocd.argoproj.io/tracking-id=heimdall:velero.io/Schedule:velero/heimdall-daily-aurora-backup
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"velero.io/v1","kind":"Schedule","metadata":{"annotations":{"argocd.argoproj.io/tracking-id":"heimdall:velero.io/Schedule:velero/heimdall-daily-aurora-backup"},"name":"heimdall-daily-aurora-backup","namespace":"velero"},"spec":{"schedule":"0 2 * * *","template":{"csiSnapshotTimeout":"10m","includeClusterResources":false,"includedNamespaces":["heimdall"],"includedResources":["*"],"snapshotMoveData":true,"snapshotVolumes":true,"storageLocation":"default","ttl":"170h"},"useOwnerReferencesInBackup":false}}

  velero.io/resource-timeout=10m0s
  velero.io/source-cluster-k8s-gitversion=v1.25.5
  velero.io/source-cluster-k8s-major-version=1
  velero.io/source-cluster-k8s-minor-version=25

Phase:  Completed

Namespaces:
  Included:  heimdall
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  excluded

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  true
Snapshot Move Data:          true
Data Mover:                  velero

TTL:  170h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-04-24 12:03:50 +0100 BST
Completed:  2024-04-24 12:04:02 +0100 BST

Expiration:  2024-05-01 14:03:50 +0100 BST

Total items to be backed up:  51
Items backed up:              51

Backup Volumes:
  <error getting backup volume info: gzip: invalid header>

HooksAttempted:  0
HooksFailed:     0

Im not sure but this portion of the deployment logs may indicate the issue, where it claims to be unable to find the PVC, however when restoring, it does actually restore the PVC. Its unclear to me if a data move has occurred at all and no datauploads appear to be created:

time="2024-04-24T11:04:02Z" level=info msg="hookTracker: map[], hookAttempted: 0, hookFailed: 0" backup=velero/heimdall-backup logSource="pkg/backup/backup.go:436"
time="2024-04-24T11:04:02Z" level=info msg="Summary for skipped PVs: []" backup=velero/heimdall-backup logSource="pkg/backup/backup.go:445"
time="2024-04-24T11:04:02Z" level=info msg="Backed up a total of 51 items" backup=velero/heimdall-backup logSource="pkg/backup/backup.go:449" progress=
time="2024-04-24T11:04:02Z" level=info msg="backup SnapshotMoveData is set to true, skip VolumeSnapshot resource persistence." backup=velero/heimdall-backup logSource="pkg/backup/snapshots.go:46"
time="2024-04-24T11:04:02Z" level=info msg="Setting up backup store to persist the backup" backup=velero/heimdall-backup logSource="pkg/controller/backup_controller.go:729"
time="2024-04-24T11:04:02Z" level=warning msg="Cannot find info for PVC heimdall/ext-heimdall-config" backup=velero/heimdall-backup logSource="internal/volume/volumes_information.go:450"
time="2024-04-24T11:04:02Z" level=warning msg="Cannot find info for PVC heimdall/int-heimdall-config" backup=velero/heimdall-backup logSource="internal/volume/volumes_information.go:450"
time="2024-04-24T11:04:02Z" level=info msg="Initial backup processing complete, moving to Finalizing" backup=velero/heimdall-backup logSource="pkg/controller/backup_controller.go:743"
time="2024-04-24T11:04:02Z" level=info msg="Updating backup's final status" backuprequest=velero/heimdall-backup controller=backup logSource="pkg/controller/backup_controller.go:307"
time="2024-04-24T11:04:57Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:141"
time="2024-04-24T11:04:57Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126"
time="2024-04-24T11:06:57Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:141"

What did you expect to happen:

DataUploads should be created with evidence in the logs of the move occuring.

The following information will help us better understand what's going on:

bundle-2024-04-24-12-13-46.tar.gz

Anything else you would like to add:

Rook storage config

---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 3
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-ceph-block
  # annotations:
  #   storageclass.kubernetes.io/is-default-class: "true"
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph
  pool: replicapool
  imageFormat: "2"
  imageFeatures: layering
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
  csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Delete
allowVolumeExpansion: true
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-rbdplugin-snapclass
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph
  csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
  velero.io/csi-volumesnapshot-class: "true"
deletionPolicy: Retain

Velero Helm Values

---
resources:
  requests:
    cpu: 500m
    memory: 128Mi
  limits:
    memory: 512Mi
initContainers:
  - name: velero-plugin-for-csi
    image: velero/velero-plugin-for-csi:v0.7.1
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.9.2
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
metrics:
  enabled: true
  scrapeInterval: 30s
  scrapeTimeout: 10s
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8085"
    prometheus.io/path: "/metrics"
  serviceMonitor:
    enabled: true
    additionalLabels: {}
    # ServiceMonitor namespace. Default to Velero namespace.
    namespace: velero
  nodeAgentPodMonitor:
    autodetect: true
    enabled: false
    annotations: {}
    additionalLabels: {}

upgradeCRDs: true
cleanUpCRDs: false

configuration:
  backupStorageLocation:
    # name is the name of the backup storage location where backups should be stored. If a name is not provided,
    # a backup storage location will be created with the name "default". Optional.
    - name: default
      provider: velero.io/aws
      bucket: velero
      credential:
        name: aurora-cloud-provider-credentials
        key: cloud
      backupSyncPeriod: 2m0s
      validationFrequency: 2m0s
      config:
        region: aurora-minio
        s3ForcePathStyle: "true"
        s3Url: "http://IPADDRESS:9000"
        publicUrl: "https://URL"
        profile: "aurora"
        # insecureSkipTLSVerify: "true"
  volumeSnapshotLocation:
    - name: rook-ceph
      provider: csi
  uploaderType: kopia
  defaultBackupStorageLocation: default
  defaultVolumeSnapshotLocations: csi:rook-ceph
  defaultSnapshotMoveData: true
  features: EnableCSI

rbac:
  create: true
  clusterAdministrator: true
  clusterAdministratorName: cluster-admin

serviceAccount:
  server:
    create: true

credentials:
  useSecret: true
  existingSecret: aurora-cloud-provider-credentials

# Whether to create backupstoragelocation crd, if false => do not create a default backup location
backupsEnabled: true
# Whether to create volumesnapshotlocation crd, if false => disable snapshot feature
snapshotsEnabled: true

deployNodeAgent: true
nodeAgent:
  podVolumePath: /var/lib/kubelet/pods
  privileged: true
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      memory: 1024Mi

Velero Backup Schedule used as template

---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: heimdall-daily-aurora-backup
  namespace: velero
spec:
  schedule: 0 2 * * *
  useOwnerReferencesInBackup: false
  template:
    csiSnapshotTimeout: 10m
    includedNamespaces:
    - 'heimdall'
    includedResources:
    - '*'
    excludedResources:
    - ingress
    - replicaset
    includeClusterResources: false
    snapshotVolumes: true
    storageLocation: default
    # The list of locations in which to store volume snapshots created for backups under this schedule.
    # volumeSnapshotLocations:
    # - 'csi:rook-ceph'
    ttl: 170h
    snapshotMoveData: true

Environment:

Client: Version: v1.13.1 Git commit: ea5a89f83b89b2cb7a27f54148683c1ee8d57a37 Server: Version: v1.13.0 features: EnableCSI Client Version: v1.29.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.25.5 Self hosted on metal Rook-Ceph v1.14.2

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

allenxu404 commented 2 months ago

It seems that the dataupload objects were created and deleted along with the backup when it reached its expiration:

time="2024-04-24T11:02:19Z" level=info msg="Removing local datauploads" backup=heimdall-backup controller=backup-deletion deletebackuprequest=velero/heimdall-backup-rq7qn logSource="pkg/controller/backup_deletion_controller.go:327"
B1ue-W01f commented 2 months ago

It seems that the dataupload objects were created and deleted along with the backup when it reached its expiration:

time="2024-04-24T11:02:19Z" level=info msg="Removing local datauploads" backup=heimdall-backup controller=backup-deletion deletebackuprequest=velero/heimdall-backup-rq7qn logSource="pkg/controller/backup_deletion_controller.go:327"

Interesting thanks! I guess my watch command on them had the wrong label then.

Do you by any chance know what the warning related to not finding PVC info could infer?

blackpiglet commented 2 months ago

I added some comments to explain why there are some logs that complain no volume found for PVC.

The backup excludes the cluster-scoped resources explicitly.

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  excluded

As a result, the PV resources are not included.

level=warning msg="Cannot find info for PVC heimdall/ext-heimdall-config" backup=velero/heimdall-backup logSource="internal/volume/volumes_information.go:450" This line means PV information is not found for the specified PVC. It is supposed for this backup's filter setting.

B1ue-W01f commented 2 months ago

Ah thanks @blackpiglet, that makes a lot of sense now. Presumably that's fine as the PV is effectively backed up already through kopia anyway - hence why the issue was reported as a warning not an error.

Cheers all.