Closed Naveen-Kamagani closed 1 week ago
Can someone help me on this issue?
time="2024-06-13T12:03:42Z" level=info msg="All Velero custom resource definitions exist" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:513"
time="2024-06-13T12:03:42Z" level=warning msg="Velero node agent not found; pod volume backups/restores will not work until it's created" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:585"
time="2024-06-13T12:03:46Z" level=warning msg="found a backup with status \"InProgress\" during the server starting, mark it as \"Failed\"" backup=bcdr-scheduler-20240613120039 logSource="/remote-source/velero/app/pkg/cmd/server/server.go:1077"
time="2024-06-13T12:03:46Z" level=info msg="Starting controllers" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:640"
The backup that was marked failed was InProgress when the velero pod crashed or was killed. If you didn't shut velero down on purpose, then the next thing to do would be to figure out why velero crashed -- the failed backup is a symptom of the crash. That's what will happen if velero restarts when a backup or restore is running.
@sseago how can we debug this issue in the velero pod logs or oadp controller logs, there is no trace of backup(referring to old) that was failed. I have deleted all the Failed backups and then tried manual backup, still not able to run backup successfully.
Expected resolution -
@Naveen-Kamagani if the velero pod is crashing, you would probably need to look at pod logs (although that could be lost on pod crashing -- if this happens consistently, perhaps tailing the logs during the backup would help), and there may be pod events related to the crash.
@sseago there is nothing in the velero pod logs latest log -
time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemAction logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/21-role-bindings-restore-plugin
time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemAction logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/22-cluster-role-bindings-restore-plugin
time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemAction logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/23-imagetag-restore-plugin
time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemAction logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/24-horizontalpodautoscaler-restore-plugin
time="2024-06-14T12:02:55Z" level=info msg="registering plugin" command=/plugins/velero-plugins kind=RestoreItemActionV2 logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/registry.go:101" name=openshift.io/04-imagestreamtag-restore-plugin
I0614 12:02:56.572830 1 request.go:690] Waited for 1.025915565s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/k8s.cni.cncf.io/v1?timeout=32s
time="2024-06-14T12:02:59Z" level=info msg="Metrics server is starting to listen" addr=":8080" logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" logger=controller-runtime.metrics
time="2024-06-14T12:02:59Z" level=info msg="Waiting for informer caches to sync" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:621"
time="2024-06-14T12:03:00Z" level=info msg="Done waiting for informer caches to sync" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:623"
time="2024-06-14T12:03:00Z" level=info msg="Informer cache synced" informer="*v1.VolumeSnapshot" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:630"
time="2024-06-14T12:03:00Z" level=info msg="Checking existence of namespace." logSource="/remote-source/velero/app/pkg/cmd/server/server.go:449" namespace=oadp-velero
time="2024-06-14T12:03:00Z" level=info msg="Namespace exists" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:455" namespace=oadp-velero
I0614 12:03:06.585463 1 request.go:690] Waited for 1.34602859s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/resolution.tekton.dev/v1beta1?timeout=32s
time="2024-06-14T12:03:11Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:484"
I0614 12:03:16.585549 1 request.go:690] Waited for 5.395845137s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v1?timeout=32s
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=BackupRepository logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=Backup logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=VolumeSnapshotLocation logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=DownloadRequest logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=ResticRepository logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=PodVolumeRestore logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=Restore logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=DeleteBackupRequest logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=Schedule logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=BackupStorageLocation logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=PodVolumeBackup logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=ServerStatusRequest logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=DataDownload logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="Found custom resource" kind=DataUpload logSource="/remote-source/velero/app/pkg/cmd/server/server.go:495"
time="2024-06-14T12:03:22Z" level=info msg="All Velero custom resource definitions exist" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:513"
time="2024-06-14T12:03:22Z" level=warning msg="Velero node agent not found; pod volume backups/restores will not work until it's created" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:585"
I0614 12:03:26.594544 1 request.go:690] Waited for 3.744737234s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/pipelinesascode.tekton.dev/v1alpha1?timeout=32s
time="2024-06-14T12:03:27Z" level=warning msg="found a backup with status \"InProgress\" during the server starting, mark it as \"Failed\"" backup=bcdr-scheduler-20240614120012 logSource="/remote-source/velero/app/pkg/cmd/server/server.go:1077"
time="2024-06-14T12:03:27Z" level=info msg="Starting controllers" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:640"
time="2024-06-14T12:03:27Z" level=info msg="Starting metric server at address [:8085]" logSource="/remote-source/velero/app/pkg/cmd/server/server.go:645"
time="2024-06-14T12:03:27Z" level=info msg="Server starting..." logSource="/remote-source/velero/app/pkg/cmd/server/server.go:969"
time="2024-06-14T12:03:27Z" level=info msg="Starting server" addr="{\"IP\":\"::\",\"Port\":8080,\"Zone\":\"\"}" kind=metrics logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" path=/metrics
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backupstoragelocation controllerGroup=velero.io controllerKind=BackupStorageLocation logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.BackupStorageLocation"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backupstoragelocation controllerGroup=velero.io controllerKind=BackupStorageLocation logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.BackupStorageLocationList"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backup controllerGroup=velero.io controllerKind=Backup logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.Backup"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=deletebackuprequest controllerGroup=velero.io controllerKind=DeleteBackupRequest logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.DeleteBackupRequest"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backupstoragelocation controllerGroup=velero.io controllerKind=BackupStorageLocation logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.BackupStorageLocation"
time="2024-06-14T12:03:27Z" level=info msg="Starting EventSource" controller=backup controllerGroup=velero.io controllerKind=Backup logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108" source="kind source: *v1.Backup"
time="2024-06-14T12:03:27Z" level=info msg="Starting Controller" controller=backup controllerGroup=velero.io controllerKind=Backup logSource="/remote-source/velero/deps/gomod/pkg/mod/github.com/bombsimon/logrusr/v3@v3.0.0/logrusr.go:108"
these are events from the namespace(nothing in events as well) -
oc get events -n oadp-velero
LAST SEEN TYPE REASON OBJECT MESSAGE
2m29s Warning DeprecationResticConfig dataprotectionapplication/bcdr-data-protection-app (Deprecation Warning) Use nodeAgent instead of restic, which is deprecated and will be removed with the OADP 1.4
naveenkamagani@Naveens-MacBook-Pro Test % oc get events -n oadp-velero --watch
LAST SEEN TYPE REASON OBJECT MESSAGE
4m49s Warning DeprecationResticConfig dataprotectionapplication/bcdr-data-protection-app (Deprecation Warning) Use nodeAgent instead of restic, which is deprecated and will be removed with the OADP 1.4
0s Warning DeprecationResticConfig dataprotectionapplication/bcdr-data-protection-app (Deprecation Warning) Use nodeAgent instead of restic, which is deprecated and will be removed with the OADP 1.4
0s Normal Pulling pod/velero-685dc7b97f-b9546 Pulling image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:9ab6bd1c70cb1fb4a641ba977e9b8ebfa727109756091e2e74026f5ecdc9841f"
0s Normal Pulled pod/velero-685dc7b97f-b9546 Successfully pulled image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:9ab6bd1c70cb1fb4a641ba977e9b8ebfa727109756091e2e74026f5ecdc9841f" in 3.638604661s (3.638613735s including waiting)
0s Normal Created pod/velero-685dc7b97f-b9546 Created container velero
0s Normal Started pod/velero-685dc7b97f-b9546 Started container velero
@sseago any thoughts on how to fix this issue ?
@Naveen-Kamagani How did you delete your failed velero backups ? Did you use the velero
CLI ? or did you use the oc
CLI ? The velero CLI removes them from object storage but the OC CLI does not. And if the failed backups are not removed from object storage then they are re-synced on to the cluster by Velero. Please remove the backup manually from object storage or use the velero
CLI to remove backups.
@Naveen-Kamagani those pod logs are from velero restarting after it crashed. Logs happening before/during the crash would be helpful -- post-crash is just normal velero startup.
@shubham-pampattiwar they may not be in object storage, depending on where in the process they were when Velero craashed.
In any case, if this is happening predictably, the next thing to do would be to watch the pod logs as you start the backup. If Velero crashes every time, then maybe there's a stack trace or message upon crash.
I'm not using the OADP operator but I am getting the same error when doing FS backups about it being marked "Failed" due to being "InProgress", likely due to node-agent crashes. It's always the same pods but I haven't been able to find any similarities between them.
I dumped the logs for velero and all the node agents and before crashing the last message logged is "Searching for parent snapshot" on the node agents. I've completely disabled volume snapshots so I'm not sure why it happens.
For now, simply excluding the 5 pods I'm getting the error on has worked. Backups now complete successfully and no other pods have started having issues.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
This issue was closed because it has been stalled for 14 days with no activity.
OADP Operator version is v1.3.2
velero log -
There are no backups triggered but still the backup is going into a failed state.