vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.61k stars 1.39k forks source link

OADP Operator - All scheduled backups are failing #7886

Closed Naveen-Kamagani closed 1 week ago

Naveen-Kamagani commented 3 months ago
apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    velero.io/resource-timeout: 10m0s
    velero.io/source-cluster-k8s-gitversion: v1.26.15+4818370
    velero.io/source-cluster-k8s-major-version: '1'
    velero.io/source-cluster-k8s-minor-version: '26'
  resourceVersion: '822753579'
  name: bcdr-scheduler-20240611110024
  uid: b2f10f1c-2b86-4361-953e-95f514d09ebe
  creationTimestamp: '2024-06-11T11:00:24Z'
  generation: 3
  managedFields:
    - apiVersion: velero.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:annotations':
            .: {}
            'f:velero.io/resource-timeout': {}
            'f:velero.io/source-cluster-k8s-gitversion': {}
            'f:velero.io/source-cluster-k8s-major-version': {}
            'f:velero.io/source-cluster-k8s-minor-version': {}
          'f:labels':
            .: {}
            'f:by-squad': {}
            'f:for-product': {}
            'f:velero.io/schedule-name': {}
            'f:velero.io/storage-location': {}
        'f:spec':
          'f:itemOperationTimeout': {}
          'f:snapshotVolumes': {}
          'f:metadata':
            .: {}
            'f:labels':
              .: {}
              'f:by-squad': {}
              'f:for-product': {}
              'f:velero.io/schedule-name': {}
          'f:excludedResources': {}
          'f:includedNamespaces': {}
          'f:volumeSnapshotLocations': {}
          .: {}
          'f:snapshotMoveData': {}
          'f:includedResources': {}
          'f:defaultVolumesToFsBackup': {}
          'f:excludedNamespaces': {}
          'f:ttl': {}
          'f:csiSnapshotTimeout': {}
          'f:storageLocation': {}
          'f:hooks': {}
          'f:includeClusterResources': {}
        'f:status':
          .: {}
          'f:completionTimestamp': {}
          'f:expiration': {}
          'f:failureReason': {}
          'f:formatVersion': {}
          'f:phase': {}
          'f:startTimestamp': {}
          'f:version': {}
      manager: velero-server
      operation: Update
      time: '2024-06-11T11:03:26Z'
  namespace: oadp-velero
  labels:
    by-squad: mcsp-controlplane
    for-product: trbo
    velero.io/schedule-name: bcdr-scheduler
    velero.io/storage-location: bcdr-s3-location
spec:
  volumeSnapshotLocations:
    - bcdr-volumesnapshot-location
  defaultVolumesToFsBackup: false
  excludedNamespaces:
    - openshift
    - openshift-apiserver
    - openshift-apiserver-operator
    - openshift-authentication
    - openshift-authentication-operator
    - openshift-cloud-credential-operator
    - openshift-cluster-machine-approver
    - openshift-cluster-node-tuning-operator
    - openshift-cluster-samples-operator
    - openshift-cluster-storage-operator
    - openshift-cluster-version
    - openshift-config
    - openshift-config-managed
    - openshift-console
    - openshift-console-operator
    - openshift-controller-manager
    - openshift-controller-manager-operator
    - openshift-dns
    - openshift-dns-operator
    - openshift-etcd
    - openshift-image-registry
    - openshift-infra
    - openshift-ingress
    - openshift-ingress-operator
    - openshift-insights
    - openshift-kni-infra
    - openshift-kube-apiserver
    - openshift-kube-apiserver-operator
    - openshift-kube-controller-manager
    - openshift-kube-controller-manager-operator
    - openshift-kube-proxy
    - openshift-kube-scheduler
    - openshift-kube-scheduler-operator
    - openshift-machine-api
    - openshift-machine-config-operator
    - openshift-monitoring
    - openshift-multus
    - openshift-network-operator
    - openshift-node
    - openshift-openstack-infra
    - openshift-operator-lifecycle-manager
    - openshift-ovirt-infra
    - openshift-service-ca
    - openshift-service-ca-operator
    - openshift-service-catalog-apiserver-operator
    - openshift-service-catalog-controller-manager-operator
    - openshift-user-workload-monitoring
    - velero
  csiSnapshotTimeout: 10m0s
  includedResources:
    - '*'
  ttl: 504h0m0s
  itemOperationTimeout: 4h0m0s
  metadata:
    labels:
      by-squad: mcsp-controlplane
      for-product: trbo
      velero.io/schedule-name: bcdr-scheduler
  storageLocation: bcdr-s3-location
  hooks: {}
  includeClusterResources: true
  includedNamespaces:
    - '*'
  snapshotVolumes: true
  excludedResources:
    - storageclasses.storage.k8s.io
    - imagestreams.image.openshift.io
  snapshotMoveData: false
status:
  completionTimestamp: '2024-06-11T11:03:26Z'
  expiration: '2024-07-02T11:00:24Z'
  failureReason: >-
    found a backup with status "InProgress" during the server starting, mark it
    as "Failed"
  formatVersion: 1.1.0
  phase: Failed
  startTimestamp: '2024-06-11T11:00:24Z'
  version: 1

OADP Operator version is v1.3.2

oc -n oadp-velero exec deployment/velero -c velero -- ./velero \
  version  
Client:
        Version: v1.12.4-OADP
        Git commit: -
Server:
        Version: v1.12.4-OADP

velero log -

time="2024-06-11T11:03:22Z" level=info msg="All Velero custom resource definitions exist" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:513"
time="2024-06-11T11:03:22Z" level=warning msg="Velero node agent not found; pod volume backups/restores will not work until it's created" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:585"
I0611 11:03:25.942975 1 request.go:690] Waited for 3.696252465s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
time="2024-06-11T11:03:26Z" level=warning msg="found a backup with status \"InProgress\" during the server starting, mark it as \"Failed\"" backup=bcdr-scheduler-20240611110024 logSource="/remote-source/velero/app/pkg/cmd/server/server.go:1077"

There are no backups triggered but still the backup is going into a failed state.

Naveen-Kamagani commented 3 months ago

Can someone help me on this issue?

time="2024-06-13T12:03:42Z" level=info msg="All Velero custom resource definitions exist" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:513"
time="2024-06-13T12:03:42Z" level=warning msg="Velero node agent not found; pod volume backups/restores will not work until it's created" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:585"
time="2024-06-13T12:03:46Z" level=warning msg="found a backup with status \"InProgress\" during the server starting, mark it as \"Failed\"" backup=bcdr-scheduler-20240613120039 logSource="/remote-source/velero/app/pkg/cmd/server/server.go:1077"
time="2024-06-13T12:03:46Z" level=info msg="Starting controllers" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:640"
sseago commented 3 months ago

The backup that was marked failed was InProgress when the velero pod crashed or was killed. If you didn't shut velero down on purpose, then the next thing to do would be to figure out why velero crashed -- the failed backup is a symptom of the crash. That's what will happen if velero restarts when a backup or restore is running.

Naveen-Kamagani commented 3 months ago

@sseago how can we debug this issue in the velero pod logs or oadp controller logs, there is no trace of backup(referring to old) that was failed. I have deleted all the Failed backups and then tried manual backup, still not able to run backup successfully.

Expected resolution -

  1. Run the Velero backups successfully without issues using the Velero schedule.
sseago commented 3 months ago

@Naveen-Kamagani if the velero pod is crashing, you would probably need to look at pod logs (although that could be lost on pod crashing -- if this happens consistently, perhaps tailing the logs during the backup would help), and there may be pod events related to the crash.

Naveen-Kamagani commented 3 months ago

@sseago there is nothing in the velero pod logs latest log -

time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemAction logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/21-role-bindings-restore-plugin
time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemAction logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/22-cluster-role-bindings-restore-plugin
time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemAction logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/23-imagetag-restore-plugin
time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemAction logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/24-horizontalpodautoscaler-restore-plugin
time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemActionV2 logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/04-imagestreamtag-restore-plugin
I0614 12:02:56.572830       1 request.go:690] Waited for 1.025915565s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/k8s.cni.cncf.io/v1?timeout=32s
time="2024-06-14T12:02:59Z" level=info msg="Metrics server is starting to listen" addr=":8080" logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" logger=controller-runtime.metrics
time="2024-06-14T12:02:59Z" level=info msg="Waiting for informer caches to sync" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:621"
time="2024-06-14T12:03:00Z" level=info msg="Done waiting for informer caches to sync" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:623"
time="2024-06-14T12:03:00Z" level=info msg="Informer cache synced" informer="*v1.VolumeSnapshot" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:630"
time="2024-06-14T12:03:00Z" level=info msg="Checking existence of namespace." logSource="/remote-source/velero/app/pkg/cmd/server/server.go:449" namespace=oadp-velero
time="2024-06-14T12:03:00Z" level=info msg="Namespace exists" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:455" namespace=oadp-velero
I0614 12:03:06.585463       1 request.go:690] Waited for 1.34602859s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/resolution.tekton.dev/v1beta1?timeout=32s
time="2024-06-14T12:03:11Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:484"
I0614 12:03:16.585549       1 request.go:690] Waited for 5.395845137s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v1?timeout=32s
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=BackupRepository logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=Backup logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=VolumeSnapshotLocation logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=DownloadRequest logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=ResticRepository logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=PodVolumeRestore logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=Restore logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=DeleteBackupRequest logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=Schedule logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=BackupStorageLocation logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=PodVolumeBackup logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=ServerStatusRequest logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=DataDownload logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=DataUpload logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="All Velero custom resource definitions exist" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:513"
time="2024-06-14T12:03:22Z" level=warning msg="Velero node agent not found; pod volume backups/restores will not work until it's created" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:585"
I0614 12:03:26.594544       1 request.go:690] Waited for 3.744737234s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/pipelinesascode.tekton.dev/v1alpha1?timeout=32s
time="2024-06-14T12:03:27Z" level=warning msg="found a backup with status \"InProgress\" during the server starting, mark it as \"Failed\"" backup=bcdr-scheduler-20240614120012 logSource="/remote-source/velero/app/pkg/cmd/server/server.go:1077"
time="2024-06-14T12:03:27Z" level=info msg="Starting controllers" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:640"
time="2024-06-14T12:03:27Z" level=info msg="Starting metric server at address [:8085]" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:645"
time="2024-06-14T12:03:27Z" level=info msg="Server starting..." logSource="/remote-source/velero/app/pkg/cmd/server/server.go:969"
time="2024-06-14T12:03:27Z" level=info msg="Starting server" addr="{\"IP\":\"::\",\"Port\":8080,\"Zone\":\"\"}" kind=metrics logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" path=/metrics
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backupstoragelocation controllerGroup=velero.io controllerKind=BackupStorageLocation logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.BackupStorageLocation"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backupstoragelocation controllerGroup=velero.io controllerKind=BackupStorageLocation logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.BackupStorageLocationList"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backup controllerGroup=velero.io controllerKind=Backup logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.Backup"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=deletebackuprequest controllerGroup=velero.io controllerKind=DeleteBackupRequest logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.DeleteBackupRequest"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backupstoragelocation controllerGroup=velero.io controllerKind=BackupStorageLocation logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.BackupStorageLocation"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backup controllerGroup=velero.io controllerKind=Backup logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.Backup"
time="2024-06-14T12:03:27Z" level=info msg="Starting Controller" controller=backup controllerGroup=velero.io controllerKind=Backup logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108"
Naveen-Kamagani commented 3 months ago

these are events from the namespace(nothing in events as well) -

oc get events -n oadp-velero
LAST SEEN   TYPE      REASON                    OBJECT                                               MESSAGE
2m29s       Warning   DeprecationResticConfig   dataprotectionapplication/bcdr-data-protection-app   (Deprecation Warning) Use nodeAgent instead of restic, which is deprecated and will be removed with the OADP 1.4
naveenkamagani@Naveens-MacBook-Pro Test % oc get events -n oadp-velero --watch
LAST SEEN   TYPE      REASON                    OBJECT                                               MESSAGE
4m49s       Warning   DeprecationResticConfig   dataprotectionapplication/bcdr-data-protection-app   (Deprecation Warning) Use nodeAgent instead of restic, which is deprecated and will be removed with the OADP 1.4
0s          Warning   DeprecationResticConfig   dataprotectionapplication/bcdr-data-protection-app   (Deprecation Warning) Use nodeAgent instead of restic, which is deprecated and will be removed with the OADP 1.4
0s          Normal    Pulling                   pod/velero-685dc7b97f-b9546                          Pulling image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:9ab6bd1c70cb1fb4a641ba977e9b8ebfa727109756091e2e74026f5ecdc9841f"
0s          Normal    Pulled                    pod/velero-685dc7b97f-b9546                          Successfully pulled image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:9ab6bd1c70cb1fb4a641ba977e9b8ebfa727109756091e2e74026f5ecdc9841f" in 3.638604661s (3.638613735s including waiting)
0s          Normal    Created                   pod/velero-685dc7b97f-b9546                          Created container velero
0s          Normal    Started                   pod/velero-685dc7b97f-b9546                          Started container velero
Naveen-Kamagani commented 3 months ago

@sseago any thoughts on how to fix this issue ?

shubham-pampattiwar commented 3 months ago

@Naveen-Kamagani How did you delete your failed velero backups ? Did you use the velero CLI ? or did you use the oc CLI ? The velero CLI removes them from object storage but the OC CLI does not. And if the failed backups are not removed from object storage then they are re-synced on to the cluster by Velero. Please remove the backup manually from object storage or use the velero CLI to remove backups.

sseago commented 3 months ago

@Naveen-Kamagani those pod logs are from velero restarting after it crashed. Logs happening before/during the crash would be helpful -- post-crash is just normal velero startup.

sseago commented 3 months ago

@shubham-pampattiwar they may not be in object storage, depending on where in the process they were when Velero craashed.

sseago commented 3 months ago

In any case, if this is happening predictably, the next thing to do would be to watch the pod logs as you start the backup. If Velero crashes every time, then maybe there's a stack trace or message upon crash.

v1nsai commented 2 months ago

I'm not using the OADP operator but I am getting the same error when doing FS backups about it being marked "Failed" due to being "InProgress", likely due to node-agent crashes. It's always the same pods but I haven't been able to find any similarities between them.

I dumped the logs for velero and all the node agents and before crashing the last message logged is "Searching for parent snapshot" on the node agents. I've completely disabled volume snapshots so I'm not sure why it happens.

For now, simply excluding the 5 pods I'm getting the error on has worked. Backups now complete successfully and no other pods have started having issues.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions[bot] commented 1 week ago

This issue was closed because it has been stalled for 14 days with no activity.