OADP Operator - Backups and CSI Volume snapshots running a long time to complete the Scheduled backups

Naveen-Kamagani commented 3 months ago

We use the OADP operator to backup Kubernetes resources and EBS volumes using the CSI snapshot feature. We have created a scheduler to trigger backups daily. But each backup is running nearly 2 hours. Is there any way we can reduce the backup running time to nearly 30 minutes or within 1 hour? Is there anyway to trigger to CSI Snapshots parallel and complete them fast so that backup runs faster. We are looking for different options.

DataProtectionApplication -

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
  name: bcdr-data-protection-app
  namespace: oadp-velero
spec:
  backupLocations:
    - velero:
        config:
          profile: default
          region: us-east-1
        credential:
          key: cloud
          name: cloud-credentials
        default: true
        objectStorage:
          bucket: bcdr-stg-cp-01-bcdr-us-east-1-s3
          prefix: bcdr-stg-cp-01
        provider: aws
  configuration:
    restic:
      enable: false
    velero:
      defaultPlugins:
        - openshift
        - aws
        - csi
      featureFlags:
        - EnableCSI
      podConfig:
        resourceAllocations:
          limits:
            cpu: '3'
            memory: 3Gi
          requests:
            cpu: 500m
            memory: 512Mi
  features: {}
  podDnsConfig: {}
  snapshotLocations:
    - velero:
        config:
          profile: default
          region: us-east-1
        provider: aws
status:
  conditions:
    - lastTransitionTime: '2024-06-12T05:22:58Z'
      message: Reconcile complete
      reason: Complete
      status: 'True'
      type: Reconciled

Scheduled Backup -

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: bcdr-scheduler-20240611140020
  namespace: oadp-velero
  labels:
    by-squad: mcsp-controlplane
    for-product: trbo
    velero.io/schedule-name: bcdr-scheduler
    velero.io/storage-location: bcdr-s3-location
spec:
  volumeSnapshotLocations:
    - bcdr-volumesnapshot-location
  defaultVolumesToFsBackup: false
  excludedNamespaces:
    - openshift
    - openshift-apiserver
    - openshift-apiserver-operator
    - openshift-authentication
    - openshift-authentication-operator
    - openshift-cloud-credential-operator
    - openshift-cluster-machine-approver
    - openshift-cluster-node-tuning-operator
    - openshift-cluster-samples-operator
    - openshift-cluster-storage-operator
    - openshift-cluster-version
    - openshift-config
    - openshift-config-managed
    - openshift-console
    - openshift-console-operator
    - openshift-controller-manager
    - openshift-controller-manager-operator
    - openshift-dns
    - openshift-dns-operator
    - openshift-etcd
    - openshift-image-registry
    - openshift-infra
    - openshift-ingress
    - openshift-ingress-operator
    - openshift-insights
    - openshift-kni-infra
    - openshift-kube-apiserver
    - openshift-kube-apiserver-operator
    - openshift-kube-controller-manager
    - openshift-kube-controller-manager-operator
    - openshift-kube-proxy
    - openshift-kube-scheduler
    - openshift-kube-scheduler-operator
    - openshift-machine-api
    - openshift-machine-config-operator
    - openshift-monitoring
    - openshift-multus
    - openshift-network-operator
    - openshift-node
    - openshift-openstack-infra
    - openshift-operator-lifecycle-manager
    - openshift-ovirt-infra
    - openshift-service-ca
    - openshift-service-ca-operator
    - openshift-service-catalog-apiserver-operator
    - openshift-service-catalog-controller-manager-operator
    - openshift-user-workload-monitoring
    - velero
  csiSnapshotTimeout: 10m0s
  includedResources:
    - '*'
  ttl: 504h0m0s
  itemOperationTimeout: 4h0m0s
  metadata:
    labels:
      by-squad: mcsp-controlplane
      for-product: trbo
      velero.io/schedule-name: bcdr-scheduler
  storageLocation: bcdr-s3-location
  hooks: {}
  includeClusterResources: true
  includedNamespaces:
    - '*'
  snapshotVolumes: true
  excludedResources:
    - storageclasses.storage.k8s.io
    - imagestreams.image.openshift.io
  snapshotMoveData: false
status:
  formatVersion: 1.1.0
  backupItemOperationsCompleted: 890
  backupItemOperationsAttempted: 890
  progress:
    itemsBackedUp: 189113
    totalItems: 189113
  expiration: '2024-07-02T14:00:20Z'
  csiVolumeSnapshotsCompleted: 445
  csiVolumeSnapshotsAttempted: 445
  startTimestamp: '2024-06-11T14:00:21Z'
  version: 1
  completionTimestamp: '2024-06-11T15:47:48Z'
  phase: Completed

VolumeSnapshotLocation -

apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: bcdr-volumesnapshot-location
  namespace: oadp-velero
spec:
  config:
    profile: default
    region: us-east-1
  provider: aws

VolumeSnapshotClass -

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: bcdr-csi-vsc
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Retain

BackupStorageLocation -

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: bcdr-data-protection-app-1
  namespace: oadp-velero
  ownerReferences:
    - apiVersion: oadp.openshift.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: DataProtectionApplication
      name: bcdr-data-protection-app
      uid: f38b3e0f-6016-4fba-8ceb-21a9bd31f325
  labels:
    app.kubernetes.io/component: bsl
    app.kubernetes.io/instance: bcdr-data-protection-app-1
    app.kubernetes.io/managed-by: oadp-operator
    app.kubernetes.io/name: oadp-operator-velero
    openshift.io/oadp: 'True'
    openshift.io/oadp-registry: 'True'
spec:
  config:
    profile: default
    region: us-east-1
  credential:
    key: cloud
    name: cloud-credentials
  default: true
  objectStorage:
    bucket: bcdr-stg-cp-01-bcdr-us-east-1-s3
    prefix: bcdr-stg-cp-01
  provider: aws
status:
  lastSyncedTime: '2024-06-12T06:43:43Z'
  lastValidationTime: '2024-06-12T06:43:43Z'
  phase: Available

We do not want to use restic. Please suggest a solution to increase the efficiency because there is another cluster where 300 namespaces and each namespace will have 12 volumes. In the shared example we have 40 namespaces and it is running for nearly 4 hours, if we have to take backup for 300 namespaces cluster and backup is running nearly 20 hours.

Naveen-Kamagani commented 3 months ago

Can anyone help on this issue ?

sseago commented 3 months ago

It looks like you have almost 900 snapshots to be taken. While most of the snapshot+datamover work can be done in parallel (spread across the nodes that the associated pods are running on), there is some initial time when starting to take the snapshots that must be done synchronously which takes approx 7-10 seconds. This is the bulk of your 2 hours.

We are working on a feature for the future which will allow the entire snapshot/pv backup process to happen in parallel via several controller threads, but that is not available today. Once that enhancement is implemented, you will see a significant reduction in backup times for this use case.

Naveen-Kamagani commented 3 months ago

@sseago Backups are triggered using the Velero schedule. How do we know the snapshots triggered daily are incremental by default?

sseago commented 3 months ago

@Naveen-Kamagani Whether a CSI snapshot is incremental is determined by the CSI driver, not by velero. If you're using data movement (although from the above backup configuration, it looks like you are not), the storage of backup content in the Backup Storage Location will always be done incrementally if there is a prior backup for that volume.

sseago commented 3 months ago

@Naveen-Kamagani but note that even with incremental backups, the 7-10 seconds at the beginning of each before we can move on to the next will still be there. So the 2 hour time for this backup will only be improved when we get the parallel backup design implemeneted.

Naveen-Kamagani commented 3 months ago

@sseago Is there a design in place to trigger to parallel snapshot backup of EBS volumes by CSI driver, could you please let us know how many months it will be to implement this feature?

sseago commented 3 months ago

@Naveen-Kamagani There is an open design PR for this: https://github.com/vmware-tanzu/velero/issues/7474

My current expectation is that Phase 1 will be implemented in Velero 1.15, and Phase 2 (which actually puts parallel item backup in place) in Velero 1.16. Velero 1.14 was just released today, so we're talking two releases in the future. There's not current release date for 1.16, but I imagine it will be during the first half of 2025 at some point. Maybe the first quarter, but that's not certain.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been stalled for 14 days with no activity.

vmware-tanzu / velero

OADP Operator - Backups and CSI Volume snapshots running a long time to complete the Scheduled backups #7887