vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.65k stars 1.39k forks source link

PartiallyFailed Backup because restic issue #2073

Closed adawolfs closed 4 years ago

adawolfs commented 4 years ago

Hola!

I have an issue when I try to create backups at some Openshift projects (namespaces) It seems like restic is unable to perform backups at some volumes, I'm actually using velero + gcp in order to backup database and application files from glusterfs volumes in openshift 3.11, the pods was created by DeploymentConfigs, Deployments and Statefulsets

The annotations was added using something like this: oc patch deployment/mysql -p '{"spec":{"template":{"metadata":{"annotations":{"backup.velero.io/backup-volumes": "mysql-data"}}}}}' After using velero backup create backup-name-$(date -u +%Y%m%d%H%M%S) --include-namespaces namespace-name to create a backup some of them ends in a PartiallyFailed (X errors) status, but this happens just in some projects

At restic logs I found

error `controller=pod-volume-backup error="signal: killed" error.file="/go/src/github.com/heptio/velero/pkg/controller/pod_volume_backup_controller.go:245"

time="2019-11-19T23:26:58Z" level=error msg="Error running command=restic backup --repo=gs:pcb-c2151de:/default-location/restic/xxx1801-tst-liferay --password-file=/tmp/velero-restic-credentials-xxx1801-tst-liferay700070883 --cache-dir=/scratch/.cache/restic . --tag=volume=mysql-data --tag=backup=xxx1801-tst-20191119232553 --tag=backup-uid=eecd5863-0b23-11ea-a694-6cb3112359c4 --tag=ns=xxx1801-tst-liferay --tag=pod=xxxst-mysql-59694d4c65-7p8nj --tag=pod-uid=bff7a8c6-0b23-11ea-a877-6cb3112359f0 --host=velero, stdout=open repository\ncreated new cache in /scratch/.cache/restic\n, stderr=" backup=velero/xxx1801-tst-20191119232553 controller=pod-volume-backup error="signal: killed" error.file="/go/src/github.com/heptio/velero/pkg/controller/pod_volume_backup_controller.go:245" error.function="github.com/heptio/velero/pkg/controller.(*podVolumeBackupController).processBackup" logSource="pkg/controller/pod_volume_backup_controller.go:245" name=xxx1801-tst-20191119232553-7sbbq namespace=velero
W1119 23:28:57.299178       1 reflector.go:289] github.com/heptio/velero/pkg/generated/informers/externalversions/factory.go:117: watch of *v1.PodVolumeRestore ended with: The resourceVersion for the provided watch is too old.
W1119 23:29:15.320123       1 reflector.go:289] github.com/heptio/velero/pkg/cmd/cli/restic/server.go:197: watch of *v1.Secret ended with: The resourceVersion for the provided watch is too old.
W1119 23:33:32.371220       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.PersistentVolume ended with: The resourceVersion for the provided watch is too old.
W1119 23:34:37.364517       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.PersistentVolumeClaim ended with: The resourceVersion for the provided watch is too old.
W1119 23:35:57.317734       1 reflector.go:289] github.com/heptio/velero/pkg/generated/informers/externalversions/factory.go:117: watch of *v1.PodVolumeBackup ended with: The resourceVersion for the provided watch is too old.
W1119 23:41:56.450433       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.PersistentVolume ended with: The resourceVersion for the provided watch is too old.

Could you please help me to figure out what is actually making backups to fail?

I'm adding all the useful logs here: https://gist.github.com/adawolfs/aff5b78d068bfbcf5396c36c34a65559

Server **** openshift v3.11.117 kubernetes v1.11.0+d4cacc0

adawolfs commented 4 years ago

Hi

Reading through v1.2.0 Changelog I've found that you added more verbose logs from Restic so after follow the guide to update and run the backup again I notice it ends in unable to find summary in restic backup command output and indeed message_type summary is missing in the output (It doesn't even seems to finish)

Here the last output messages: {\"message_type\":\"status\",\"seconds_elapsed\":121,\"seconds_remaining\":253,\"percent_done\":0.32345830900368333,\"total_files\":17011,\"files_done\":7182,\"total_bytes\":56567060801,\"bytes_done\":18297085832,\"current_files\":[\"/liferay-data/document_library/20157/11701345/68/92/68928.afsh/68928_1.0.afsh\",\"/liferay-data/document_library/20157/11701345/68/92/68929.afsh/68929_1.0.afsh\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":121,\"seconds_remaining\":253,\"percent_done\":0.3235266029179383,\"total_files\":17011,\"files_done\":7182,\"total_bytes\":56567060801,\"bytes_done\":18300949018,\"current_files\":[\"/liferay-data/document_library/20157/11701345/68/92/68928.afsh/68928_1.0.afsh\",\"/liferay-data/document_library/20157/11701345/68/92/68929.afsh/68929_1.0.afsh\"]}\n, stderr=" backup=velero/bpo1801-tst-20191126035227 controller=pod-volume-backup error="unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:169" error.function=github.com/vmware-tanzu/velero/pkg/restic.getSummaryLine logSource="pkg/controller/pod_volume_backup_controller.go:261" name=bpo1801-tst-20191126035227-ltrt7 namespace=velero

skriss commented 4 years ago

@adawolfs apologies for the delay, we've been kind of swamped with KubeCon and the US holidays.

With the signal: killed error you posted originally, I'm wondering if you're hitting an OOM (out-of-memory) issue in the velero restic daemonset pod. Do you have CPU/mem requests/limits on the daemonset? You might start by increasing those (at a minimum, the mem limit).

skriss commented 4 years ago

@adawolfs just checking to see if you had a chance to look at my previous comment.

skriss commented 4 years ago

Closing this out as inactive -- feel free to reach out again as needed.

andy108369 commented 2 years ago

With the signal: killed error you posted originally, I'm wondering if you're hitting an OOM (out-of-memory) issue in the velero restic daemonset pod. Do you have CPU/mem requests/limits on the daemonset? You might start by increasing those (at a minimum, the mem limit).

Not original author here, but have hit the same issue and can confirm that restic process got killed due to OOM. (Velero 1.9.0 & 1.9.1). Fixed by lifting the memory resource limit from 1Gi to 2Gi on the restic DaemonSet.

image

@skriss Since restic process did consume a little more than 1Gi for at least 2 users, you might want consider increasing the default in the Velero helm-chart.

FWIW, mem stats for restic when it consumes 1.1Gi:

see RssAnon

# cat /proc/2550974/status
Name:   restic
Umask:  0022
State:  S (sleeping)
Tgid:   2550974
Ngid:   2551098
Pid:    2550974
PPid:   2549993
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 128
Groups:  
NStgid: 2550974 44
NSpid:  2550974 44
NSpgid: 2549993 1
NSsid:  2549993 1
VmPeak:  1882924 kB
VmSize:  1882924 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   1137284 kB
VmRSS:   1137284 kB
RssAnon:     1122232 kB
RssFile:       15052 kB
RssShmem:          0 kB
VmData:  1249844 kB
VmStk:       132 kB
VmExe:      8604 kB
VmLib:         4 kB
VmPTE:      2468 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:    69
SigQ:   1/3090235
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: fffffffd7fc1feff
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp:    0
Speculation_Store_Bypass:   thread vulnerable
Cpus_allowed:   ffffffff,ffffffff
Cpus_allowed_list:  0-63
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:  0-1
voluntary_ctxt_switches:    106658
nonvoluntary_ctxt_switches: 3234

## restic process read 26Gi at this point

# cat /proc/2550974/io | numfmt --to=iec-i --field=2
rchar:        60Gi
wchar:        54Gi
syscr:   1.2Mi
syscw:   3.5Mi
read_bytes:        26Gi
write_bytes:        27Gi
cancelled_write_bytes:        15Gi

## PVC restic is backing up has 48Gi of data in it

# df -Ph |grep pvc
/dev/rbd1        98G   48G   51G  49% /var/lib/kubelet/pods/REDACTED/volumes/kubernetes.io~csi/pvc-REDACTED/mount