error getting restic backup progress

lklkxcxc commented 1 year ago

What steps did you take and what happened:

Backup volume hung and Items backed constant progress.

Check restic log report error:

time="2023-03-01T11:21:02Z" level=error msg="error getting restic backup progress" backup=velero/harbor-20230301184815 controller=pod-volume-backup error="unable to decode backup JSON line: {\"message_type\":\"status\",\"seconds_elapsed\":1856,\"percent_done\":0.03797256558287685,\"total_files\":28583,\"files_done\":1129,\"total_bytes\":742600171549,\"bytes_done\":28198433716,\"current_files\":[\"/docker/registry/v2/blobs/sha256/02/02f8a685e66f6cbd60f0ff2b11: unexpected end of JSON input" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:149" error.function=github.com/vmware-tanzu/velero/pkg/restic.decodeBackupStatusLine logSource="pkg/restic/exec_commands.go:100" name=harbor-20230301184815-chhfg namespace=velero

What did you expect to happen:

The backup job completed.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero:

time="2023-03-01T13:58:19Z" level=info msg="Validating backup storage location" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:114"
time="2023-03-01T13:58:19Z" level=info msg="Backup storage location valid, marking as available" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:121"
time="2023-03-01T13:59:19Z" level=info msg="Validating backup storage location" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:114"
time="2023-03-01T13:59:19Z" level=info msg="Backup storage location valid, marking as available" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:121"
I0301 13:59:58.297720       1 request.go:665] Waited for 1.040999827s due to client-side throttling, not priority and fairness, request: GET:https://10.233.0.1:443/api/v1?timeout=32s

velero backup describe:


velero backup describe harbor-20230301184815
Name:         harbor-20230301184815
Namespace:    velero
Labels:       velero.io/schedule-name=harbor
          velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.18.10
          velero.io/source-cluster-k8s-major-version=1
          velero.io/source-cluster-k8s-minor-version=18

Phase: InProgress

Errors: 0 Warnings: 0

Namespaces: Included: * Excluded:

Resources: Included: * Excluded: Cluster-scoped: auto

Label selector:

Storage Location: default

Velero-Native Snapshot PVs: auto

TTL: 720h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2023-03-01 18:48:15 +0800 CST Completed: <n/a>

Expiration: 2023-03-31 18:48:15 +0800 CST

Estimated total items to be backed up: 1669 Items backed up so far: 82

Velero-Native Snapshots:

Restic Backups (specify --details for more information): Completed: 3 In Progress: 1



**Environment:**

- Velero version (use `velero version`): 1.7.1

- Kubernetes version (use `kubectl version`):v1.18.10
- Kubernetes installer & version:v1.18.10
- OS (e.g. from `/etc/os-release`):CentOS Linux 7.7

lklkxcxc commented 1 year ago

time="2023-03-01T10:50:01Z" level=info msg="Backup starting" backup=velero/harbor-20230301184815 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:191" name=harbor-20230301184815-chhfg namespace=velero time="2023-03-01T10:50:01Z" level=info msg="Looking for most recent completed pod volume backup for this PVC" backup=velero/harbor-20230301184815 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:340" name=harbor-20230301184815-chhfg namespace=velero pvcUID=2dcb3d8b-2023-4600-8b74-2942ddd04259 time="2023-03-01T10:50:01Z" level=info msg="No completed pod volume backup found for PVC" backup=velero/harbor-20230301184815 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:370" name=harbor-20230301184815-chhfg namespace=velero pvcUID=2dcb3d8b-2023-4600-8b74-2942ddd04259 time="2023-03-01T10:50:01Z" level=info msg="No parent snapshot found for PVC, not using --parent flag for this backup" backup=velero/harbor-20230301184815 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:277" name=harbor-20230301184815-chhfg namespace=velero

Lyndon-Li commented 1 year ago

Looks like the stdout from Restic command has been truncated somehow: {\"message_type\":\"status\",\"seconds_elapsed\":1856,\"percent_done\":0.03797256558287685,\"total_files\":28583,\"files_done\":1129,\"total_bytes\":742600171549,\"bytes_done\":28198433716,\"current_files\":[\"/docker/registry/v2/blobs/sha256/02/02f8a685e66f6cbd60f0ff2b11

lklkxcxc commented 1 year ago

@Lyndon-Li I backup 800 GB harbor registry , above error because restic timeout. The problem was resolved when I had tried to add restic timeout .But restore job until not complete ，in progress now 100% ，now 98.%.

Lyndon-Li commented 1 year ago

For large data size, the file system backup (a.k.a. restic backup formerly) needs to take some time and some resources, especially memory resource. If the memory resource is not enough, restic pod will be killed by Kubernetes due to OOM. I am not sure if this happened in the current env or not, if this happened, the below problem may be caused by the kill: level=error msg="error getting restic backup progress" The story is like restic process was killed first, so Velero get an incomplete output.

Lyndon-Li commented 1 year ago

@lklkxcxc Could you help to confirm if OOM ever happened in the env? Meanwhile, I see you are using Velero 1.7, I suggest you upgrade to 1.10, where the Kopia path is shipped and Kopia path performs better when backing up data with large sizes. For how to use Kopia path for file system backup in 1.10, refer to this doc

lklkxcxc commented 1 year ago

I did not found OOM ever happened in the env，thans！

vmware-tanzu / velero

error getting restic backup progress #5937