vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.57k stars 1.39k forks source link

Some Restic Backups timeout #2226

Closed koehn closed 4 years ago

koehn commented 4 years ago

What steps did you take and what happened: I have velero 1.2 and restic configured to run backups. I recently switched S3 providers (to minio) and decided to start over with new backups. I ran velero backup delete --all and waited for the backups to be deleted. I removed all the resticrepositories, and podvolumebackups. I changed the backupstoragelocation and cloud-credentials for my new S3 provider. I then ran the same velero backup create full command I usually do.

What did you expect to happen: The cluster and volumes should get backed up. Some of them were (and are on my S3), and others not so much (although the restic repos did get pushed to S3).

The output of the following commands will help us better understand what's going on:

Anything else you would like to add:

Good podvolumebackup:

apiVersion: velero.io/v1
kind: PodVolumeBackup
metadata:
  annotations:
    velero.io/pvc-name: gitlab-do
  creationTimestamp: "2020-01-28T20:38:20Z"
  generateName: full-
  generation: 16
  labels:
    velero.io/backup-name: full
    velero.io/backup-uid: 741c8130-0be3-4b03-9a48-968dd69d928f
    velero.io/pvc-uid: fc0a9771-9de8-4c17-b5ce-7a220c005fea
  name: full-6svfm
  namespace: velero
  ownerReferences:
  - apiVersion: velero.io/v1
    controller: true
    kind: Backup
    name: full
    uid: 741c8130-0be3-4b03-9a48-968dd69d928f
  resourceVersion: "19626082"
  selfLink: /apis/velero.io/v1/namespaces/velero/podvolumebackups/full-6svfm
  uid: 806804fc-a054-4358-91bd-0878713b6969
spec:
  backupStorageLocation: default
  node: k8s-htz-worker-04
  pod:
    kind: Pod
    name: gitlab-596c566875-t7brs
    namespace: gitlab
    uid: 066a2068-7ba7-496a-8d12-a34e194f8f9c
  repoIdentifier: s3:https://some-s3-server/some-bucket/restic/gitlab
  tags:
    backup: full
    backup-uid: 741c8130-0be3-4b03-9a48-968dd69d928f
    ns: gitlab
    pod: gitlab-596c566875-t7brs
    pod-uid: 066a2068-7ba7-496a-8d12-a34e194f8f9c
    pvc-uid: fc0a9771-9de8-4c17-b5ce-7a220c005fea
    volume: gitlab-do
  volume: gitlab-do
status:
  completionTimestamp: "2020-01-28T20:40:39Z"
  path: /host_pods/066a2068-7ba7-496a-8d12-a34e194f8f9c/volumes/kubernetes.io~csi/pvc-fc0a9771-9de8-4c17-b5ce-7a220c005fea/mount
  phase: Completed
  progress:
    bytesDone: 652131419
    totalBytes: 652131419
  snapshotID: 99b42b1d
  startTimestamp: "2020-01-28T20:38:20Z"

Bad podvolumebackup:

apiVersion: velero.io/v1
kind: PodVolumeBackup
metadata:
  annotations:
    velero.io/pvc-name: rainloop
  creationTimestamp: "2020-01-28T21:12:58Z"
  generateName: full-
  generation: 2
  labels:
    velero.io/backup-name: full
    velero.io/backup-uid: 741c8130-0be3-4b03-9a48-968dd69d928f
    velero.io/pvc-uid: 5418f32e-03cb-40e5-8870-943b9b5bad00
  name: full-2rjr5
  namespace: velero
  ownerReferences:
  - apiVersion: velero.io/v1
    controller: true
    kind: Backup
    name: full
    uid: 741c8130-0be3-4b03-9a48-968dd69d928f
  resourceVersion: "19634604"
  selfLink: /apis/velero.io/v1/namespaces/velero/podvolumebackups/full-2rjr5
  uid: 7c4c9f3c-6226-4e9a-b862-a17edf6df65a
spec:
  backupStorageLocation: default
  node: k8s-htz-worker-03
  pod:
    kind: Pod
    name: rainloop-68fd4ffb69-q5qzs
    namespace: mail
    uid: 03407647-5f4d-4bd1-8f12-b71e1daa2c6c
  repoIdentifier: s3:https://some-s3-server/some-bucket/restic/mail
  tags:
    backup: full
    backup-uid: 741c8130-0be3-4b03-9a48-968dd69d928f
    ns: mail
    pod: rainloop-68fd4ffb69-q5qzs
    pod-uid: 03407647-5f4d-4bd1-8f12-b71e1daa2c6c
    pvc-uid: 5418f32e-03cb-40e5-8870-943b9b5bad00
    volume: rainloop
  volume: rainloop
status:
  phase: InProgress
  progress: {}
  startTimestamp: "2020-01-28T21:12:58Z"

Environment:

koehn commented 4 years ago

Update: a subsequent backup attempt generated a panic.

time="2020-01-29T05:16:06Z" level=info msg="Skipping snapshot of persistent volume because volume is being backed up with restic." backup=velero/full2 group=v1 logSource="pkg/backup/item_backupper.go:413" name=pvc-5418f32e-03cb-40e5-8870-943b9b5bad00 namespace= persistentVolume=pvc-5418f32e-03cb-40e5-8870-943b9b5bad00 resource=persistentvolumes
time="2020-01-29T05:16:09Z" level=info msg="1 errors encountered backup up item" backup=velero/full2 group=v1 logSource="pkg/backup/resource_backupper.go:284" name=rainloop-68fd4ffb69-q5qzs namespace= resource=pods
time="2020-01-29T05:16:09Z" level=error msg="Error backing up item" backup=velero/full2 error="pod volume backup failed: error running restic backup, stderr=panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7b82c8]\n\ngoroutine 47 [running]:\ngithub.com/restic/restic/internal/repository.(*Index).ID(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\n\t/restic/internal/repository/index.go:377 +0x38\ngithub.com/restic/restic/internal/repository.(*Repository).LoadIndex.func5(0x8, 0xe6c288)\n\t/restic/internal/repository/repository.go:467 +0xce\ngolang.org/x/sync/errgroup.(*Group).Go.func1(0xc000554990, 0xc0005c3300)\n\t/restic/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x57\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\t/restic/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66\n: unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:182" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:288" name=rainloop-68fd4ffb69-q5qzs namespace= resource=pods

Full log: https://cloud.koehn.com/s/3d7SS56cs5ywFdE

ashish-amarnath commented 4 years ago

@koehn Thanks for the details on the issue. Would you be able to tell us the podVolumeOperationTimeout that you are using, by default this is set to 1h https://github.com/vmware-tanzu/velero/blob/master/pkg/cmd/server/server.go#L72 On the node k8s-htz-worker-03, were there other volumes that needed restic backup? I am suspecting if that restic pod on k8s-htz-worker-03 was overwhelmed and couldn't finish the restic backups within the timeout. I also suggest running both the velero server and restic daemon set with debug level logging which might give us more insights...

koehn commented 4 years ago

@ashish-amarnath Thanks for getting back to me.

I haven't changed the podVolumeOperationTimeout at all; I didn't know it existed. I added a volume on another pod on k8s-htz-worker-03 and started a backup. It still hangs on the rainloop pod/volume, and the load average is pretty reasonable (0.68).

I'll try another backup with debug turned on if I can figure out how…

koehn commented 4 years ago

BTW no problems backing up another volume from another pod on k8s-worker-03. It's something specific to either the rainloop pod or the volume. It's still throwing a null pointer exception in restic now, which indicates to me that at least it's trying to run the backup.

time="2020-01-29T05:16:09Z" level=error msg="Error running command=restic backup --repo=s3:https://s3.home.koehn.com/k8s-hetzner-velero/restic/mail --password-file=/tmp/velero-restic-credentials-mail910920840 --cache-dir=/scratch/.cache/restic . --tag=backup-uid=f3563d7a-4ecd-4ad7-8aca-fec231114c9a --tag=ns=mail --tag=pod=rainloop-68fd4ffb69-q5qzs --tag=pod-uid=03407647-5f4d-4bd1-8f12-b71e1daa2c6c --tag=pvc-uid=5418f32e-03cb-40e5-8870-943b9b5bad00 --tag=volume=rainloop --tag=backup=full2 --host=velero --json, stdout=, stderr=panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7b82c8]\n\ngoroutine 47 [running]:\ngithub.com/restic/restic/internal/repository.(*Index).ID(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\n\t/restic/internal/repository/index.go:377 +0x38\ngithub.com/restic/restic/internal/repository.(*Repository).LoadIndex.func5(0x8, 0xe6c288)\n\t/restic/internal/repository/repository.go:467 +0xce\ngolang.org/x/sync/errgroup.(*Group).Go.func1(0xc000554990, 0xc0005c3300)\n\t/restic/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x57\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\t/restic/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66\n" backup=velero/full2 controller=pod-volume-backup error="unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:169" error.function=github.com/vmware-tanzu/velero/pkg/restic.getSummaryLine logSource="pkg/controller/pod_volume_backup_controller.go:261" name=full2-vjldx namespace=velero

Is there a way to delete the rainloop volume from the mail restic repository? I wonder if it got corrupted somehow.

ashish-amarnath commented 4 years ago

Digging into the panic a little more. Looks like the panic is from running restic backup commands. Could you please share logs from the restic daemon set pods to confirm if the root cause is Velero.

ashish-amarnath commented 4 years ago

about deleting the rainloop volume from the restic repository... Looking at the https://github.com/vmware-tanzu/velero/blob/master/pkg/controller/restic_repository_controller.go I don't see Velero handling restic repository deletes. I wonder if you can just blow away the restic repository in the volume backup store to accomplish this. @skriss Thoughts?

skriss commented 4 years ago

I see the panic is coming from within the restic binary itself, it does sound like possibly something got corrupted there - perhaps a previous restic backup operation terminated abnormally, hence the hang from the velero perspective and now the panic from restic. You could try connecting to the underlying restic repo directly, using the restic CLI, and running some combination of restic check (to check the repo's integrity) , restic rebuild-index (since the panic seems to be related to accessing the index), and/or restic forget (to remove particular snapshot(s) from the repo, probably with the --prune flag). You can see from the error output above what some of the other necessary flags are, e.g. the --repo flag. That may enable you to get back to a working state.

koehn commented 4 years ago

I deleted all the resticrepositories, backups, deletebackuprequests, downloadrequests, podvolumebackups, etc. from k8s, emptied the S3 bucket, deleted every restic and velero pod in an attempt to purge any leftover data from my system and get to a clean slate.

When I re-ran the backup I still get the exact same results: the first time it hangs on rainloop, after that it backs up all volumes except rainloop. I'm going to manually recreate that volume and see what's up.

koehn commented 4 years ago

OK, more tinkering and more data:

  1. I ran restic check, restic rebuild-index and restic forget --prune. No issues anywhere.

  2. I ran another backup, and for whatever reason it backed up rainloop before dovecot this time (every other time dovecot ran first). This time rainloop backed up successfully and dovecot failed. These are the only two pods in the same namespace (and thus the same restic repo). My guess is that there may be an issue in either velero or restic that's not liking this at all.

koehn commented 4 years ago

Relevant restic pod log including velero crash:

time="2020-01-30T00:53:29Z" level=info msg="Setting log-level to INFO"
time="2020-01-30T00:53:29Z" level=info msg="Starting Velero restic server v1.2.0 (5d008491bbf681658d3e372da1a9d3a21ca4c03c)" logSource="pkg/cmd/cli/restic/server.go:62"
time="2020-01-30T00:53:29Z" level=info msg="Starting controllers" logSource="pkg/cmd/cli/restic/server.go:156"
time="2020-01-30T00:53:29Z" level=info msg="Controllers started successfully" logSource="pkg/cmd/cli/restic/server.go:199"
time="2020-01-30T00:53:29Z" level=info msg="Starting controller" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:76"
time="2020-01-30T00:53:29Z" level=info msg="Waiting for caches to sync" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:79"
time="2020-01-30T00:53:29Z" level=info msg="Starting controller" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:76"
time="2020-01-30T00:53:29Z" level=info msg="Waiting for caches to sync" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:79"
time="2020-01-30T00:53:29Z" level=info msg="Caches are synced" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:83"
time="2020-01-30T00:53:29Z" level=info msg="Caches are synced" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:83"
time="2020-01-30T01:22:24Z" level=info msg="Backup starting" backup=velero/full2 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:178" name=full2-jl6g8 namespace=velero
time="2020-01-30T01:22:24Z" level=info msg="Looking for most recent completed pod volume backup for this PVC" backup=velero/full2 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:301" name=full2-jl6g8 namespace=velero pvcUID=5418f32e-03cb-40e5-8870-943b9b5bad00
time="2020-01-30T01:22:24Z" level=info msg="No completed pod volume backup found for PVC" backup=velero/full2 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:323" name=full2-jl6g8 namespace=velero pvcUID=5418f32e-03cb-40e5-8870-943b9b5bad00
time="2020-01-30T01:22:24Z" level=info msg="No parent snapshot found for PVC, not using --parent flag for this backup" backup=velero/full2 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:247" name=full2-jl6g8 namespace=velero
time="2020-01-30T01:22:27Z" level=error msg="Error running command=restic backup --repo=s3:https://s3.home.koehn.com/k8s-hetzner-velero/restic/mail --password-file=/tmp/velero-restic-credentials-mail113980895 --cache-dir=/scratch/.cache/restic . --tag=pod-uid=03407647-5f4d-4bd1-8f12-b71e1daa2c6c --tag=pvc-uid=5418f32e-03cb-40e5-8870-943b9b5bad00 --tag=volume=rainloop --tag=backup=full2 --tag=backup-uid=95503bbe-51b0-4b2f-b69d-05f0dac0197d --tag=ns=mail --tag=pod=rainloop-68fd4ffb69-q5qzs --host=velero --json, stdout=, stderr=panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7b82c8]\n\ngoroutine 60 [running]:\ngithub.com/restic/restic/internal/repository.(*Index).ID(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\n\t/restic/internal/repository/index.go:377 +0x38\ngithub.com/restic/restic/internal/repository.(*Repository).LoadIndex.func5(0x8, 0xe6c288)\n\t/restic/internal/repository/repository.go:467 +0xce\ngolang.org/x/sync/errgroup.(*Group).Go.func1(0xc0003de120, 0xc0001f4f60)\n\t/restic/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x57\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\t/restic/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66\n" backup=velero/full2 controller=pod-volume-backup error="unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:169" error.function=github.com/vmware-tanzu/velero/pkg/restic.getSummaryLine logSource="pkg/controller/pod_volume_backup_controller.go:261" name=full2-jl6g8 namespace=velero
time="2020-01-30T01:22:27Z" level=info msg="Backup starting" backup=velero/full2 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:178" name=full2-vfrrk namespace=velero
time="2020-01-30T01:22:27Z" level=info msg="Looking for most recent completed pod volume backup for this PVC" backup=velero/full2 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:301" name=full2-vfrrk namespace=velero pvcUID=8bf40c15-9ca8-47b8-8c17-864f01c3bd18
time="2020-01-30T01:22:27Z" level=info msg="No completed pod volume backup found for PVC" backup=velero/full2 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:323" name=full2-vfrrk namespace=velero pvcUID=8bf40c15-9ca8-47b8-8c17-864f01c3bd18
time="2020-01-30T01:22:27Z" level=info msg="No parent snapshot found for PVC, not using --parent flag for this backup" backup=velero/full2 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:247" name=full2-vfrrk namespace=velero
time="2020-01-30T01:23:46Z" level=info msg="Backup completed" backup=velero/full2 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:291" name=full2-vfrrk namespace=velero
time="2020-01-30T14:08:47Z" level=info msg="Backup starting" backup=velero/full3 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:178" name=full3-8knk9 namespace=velero
time="2020-01-30T14:08:47Z" level=info msg="Looking for most recent completed pod volume backup for this PVC" backup=velero/full3 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:301" name=full3-8knk9 namespace=velero pvcUID=3937ca6a-012d-4f5f-9519-abc240199b89
time="2020-01-30T14:08:47Z" level=info msg="No completed pod volume backup found for PVC" backup=velero/full3 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:323" name=full3-8knk9 namespace=velero pvcUID=3937ca6a-012d-4f5f-9519-abc240199b89
time="2020-01-30T14:08:47Z" level=info msg="No parent snapshot found for PVC, not using --parent flag for this backup" backup=velero/full3 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:247" name=full3-8knk9 namespace=velero
time="2020-01-30T14:08:50Z" level=error msg="Error running command=restic backup --repo=s3:https://s3.home.koehn.com/k8s-hetzner-velero/restic/mail --password-file=/tmp/velero-restic-credentials-mail328325481 --cache-dir=/scratch/.cache/restic . --tag=backup-uid=9d36aaf2-6414-4bc6-a9c8-b14cdac57151 --tag=ns=mail --tag=pod=rainloop-7d459d59bc-wc9wt --tag=pod-uid=01227643-5630-4038-8972-67d0887b8216 --tag=pvc-uid=3937ca6a-012d-4f5f-9519-abc240199b89 --tag=volume=rainloop --tag=backup=full3 --host=velero --json, stdout=, stderr=panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7b82c8]\n\ngoroutine 46 [running]:\ngithub.com/restic/restic/internal/repository.(*Index).ID(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\n\t/restic/internal/repository/index.go:377 +0x38\ngithub.com/restic/restic/internal/repository.(*Repository).LoadIndex.func5(0x8, 0xe6c288)\n\t/restic/internal/repository/repository.go:467 +0xce\ngolang.org/x/sync/errgroup.(*Group).Go.func1(0xc00038ac60, 0xc000561400)\n\t/restic/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x57\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\t/restic/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66\n" backup=velero/full3 controller=pod-volume-backup error="unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:169" error.function=github.com/vmware-tanzu/velero/pkg/restic.getSummaryLine logSource="pkg/controller/pod_volume_backup_controller.go:261" name=full3-8knk9 namespace=velero
time="2020-01-30T14:08:50Z" level=info msg="Backup starting" backup=velero/full3 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:178" name=full3-7nrp7 namespace=velero
time="2020-01-30T14:08:50Z" level=info msg="Looking for most recent completed pod volume backup for this PVC" backup=velero/full3 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:301" name=full3-7nrp7 namespace=velero pvcUID=8bf40c15-9ca8-47b8-8c17-864f01c3bd18
time="2020-01-30T14:08:50Z" level=info msg="Found most recent completed pod volume backup for PVC" backup=velero/full3 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:330" name=full3-7nrp7 namespace=velero parentPodVolumeBackup=full2-vfrrk parentSnapshotID=b5ba6d7f pvcUID=8bf40c15-9ca8-47b8-8c17-864f01c3bd18
time="2020-01-30T14:08:50Z" level=info msg="Setting --parent flag for this backup" backup=velero/full3 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:249" name=full3-7nrp7 namespace=velero parentSnapshotID=b5ba6d7f
time="2020-01-30T15:57:19Z" level=info msg="Backup completed" backup=velero/full3 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:291" name=full3-7nrp7 namespace=velero
time="2020-01-30T15:59:46Z" level=info msg="Backup starting" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:178" name=full4-mcrpj namespace=velero
time="2020-01-30T15:59:46Z" level=info msg="Looking for most recent completed pod volume backup for this PVC" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:301" name=full4-mcrpj namespace=velero pvcUID=3937ca6a-012d-4f5f-9519-abc240199b89
time="2020-01-30T15:59:46Z" level=info msg="No completed pod volume backup found for PVC" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:323" name=full4-mcrpj namespace=velero pvcUID=3937ca6a-012d-4f5f-9519-abc240199b89
time="2020-01-30T15:59:46Z" level=info msg="No parent snapshot found for PVC, not using --parent flag for this backup" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:247" name=full4-mcrpj namespace=velero
time="2020-01-30T16:00:01Z" level=info msg="Backup completed" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:291" name=full4-mcrpj namespace=velero
time="2020-01-30T16:00:01Z" level=info msg="Backup starting" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:178" name=full4-d47kq namespace=velero
time="2020-01-30T16:00:01Z" level=info msg="Looking for most recent completed pod volume backup for this PVC" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:301" name=full4-d47kq namespace=velero pvcUID=8bf40c15-9ca8-47b8-8c17-864f01c3bd18
time="2020-01-30T16:00:01Z" level=info msg="Found most recent completed pod volume backup for PVC" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:330" name=full4-d47kq namespace=velero parentPodVolumeBackup=full3-7nrp7 parentSnapshotID=0cfafb63 pvcUID=8bf40c15-9ca8-47b8-8c17-864f01c3bd18
time="2020-01-30T16:00:01Z" level=info msg="Setting --parent flag for this backup" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:249" name=full4-d47kq namespace=velero parentSnapshotID=0cfafb63
time="2020-01-30T16:01:11Z" level=info msg="Backup completed" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:291" name=full4-d47kq namespace=velero
time="2020-01-30T16:01:11Z" level=info msg="Backup starting" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:178" name=full4-bw2q6 namespace=velero
time="2020-01-30T16:01:11Z" level=info msg="Looking for most recent completed pod volume backup for this PVC" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:301" name=full4-bw2q6 namespace=velero pvcUID=6624c6c5-2ad4-4ec3-a487-b79005b1607b
time="2020-01-30T16:01:11Z" level=info msg="Found most recent completed pod volume backup for PVC" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:330" name=full4-bw2q6 namespace=velero parentPodVolumeBackup=full2-lm94w parentSnapshotID=bc24a8c4 pvcUID=6624c6c5-2ad4-4ec3-a487-b79005b1607b
time="2020-01-30T16:01:11Z" level=info msg="Setting --parent flag for this backup" backup=velero/full4 controller=pod-volume-backup logSource="pkg/controller/pod_volume_backup_controller.go:249" name=full4-bw2q6 namespace=velero parentSnapshotID=bc24a8c4
panic: runtime error: slice bounds out of range

goroutine 3208 [running]:
github.com/vmware-tanzu/velero/pkg/restic.getLastLine(0xc000323000, 0x0, 0x200, 0x0, 0xc000096001, 0x1846220)
    /go/src/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:157 +0xcf
github.com/vmware-tanzu/velero/pkg/restic.RunBackup.func1(0xc000863320, 0x1d3e940, 0xc0006c65f0, 0xc0008632f0, 0xc0006244e0)
    /go/src/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:99 +0x129
created by github.com/vmware-tanzu/velero/pkg/restic.RunBackup
    /go/src/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:94 +0x153
skriss commented 4 years ago

alright, well the velero panic is at least something concrete we can look into & fix. That appears to be in the code that's scanning the restic output for progress updates on the backup.

skriss commented 4 years ago

thanks for including all the detailed info!

koehn commented 4 years ago

I noticed that the pod that was supposed to run the backup had failed. A kubectl logs -p restic-12345 was able to recover the crashed pod's logs.

skriss commented 4 years ago

yep, I can see that code would definitely panic if restic's stdout is empty when it's called. we'll work on a fix here.

skriss commented 4 years ago

@ashish-amarnath do you want to work on a fix here? it looks like there's a subslicing bug at the specific line referenced for empty input, but we also need to make sure the calling code handles empty input as well (I know there's some JSON decoding going on in there).

koehn commented 4 years ago

In addition, the version of restic included in the velero/velero:1.2.0 image is out of date. Version 0.9.6 is current, and would have shown me a broken index instead of a panic.

But the root cause of my problem appears to be the velero bug killing the container, and taking restic along with it.

koehn commented 4 years ago

FWIW I've run several dozen backups with this fix in place and experienced no more errors.

skriss commented 4 years ago

fantastic - thanks again for the fix! FYI we are planning to release a v1.3 in the coming weeks that includes this patch.